Yes, it is publicly available via the Hugging Face Hub.

How do I access Zyda-2?

Load it with the Hugging Face datasets library using load_dataset('jobs-git/zyda-2').

What license does it use?

Check the dataset card on Hugging Face for the exact license terms.

Zyda-2

A 5-trillion-token dataset for language model pretraining from combined open sources.

DatasetText & NLP↓ 177K/moFree

Open dataset

Updated 2026-06-18

What is Zyda-2?

Zyda-2 aggregates multiple open-source datasets into a single 5-trillion-token corpus via deduplication and filtering. It spans web pages, educational material, mathematics, code, and research papers.

The dataset supports pretraining of large language models for text generation. It is intended for researchers and developers building models that require broad, high-quality token coverage.

What you can build with Zyda-2

Pretrain large language models

Train 7B+ parameter models from scratch on the full 5T tokens for general-purpose text generation.

Fine-tune code and math assistants

Use the filtered code, math, and scientific subsets to adapt models for programming or STEM tasks.

Benchmark data filtering pipelines

Compare model-based quality filters and cross-deduplication effects against raw web crawls.

Load Zyda-2

Python

from datasets import load_dataset

ds = load_dataset("jobs-git/Zyda-2")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('jobs-git/zyda-2', streaming=True)
4Access the train split and iterate over examples
5Tokenize with your tokenizer and feed to a training loop

Zyda-2: pros & cons

Pros

+5 trillion token scale
+Cross-deduplicated across four sources
+Model-based quality filtering applied
+Mix of web, code, math, and papers

Cons

–Requires massive storage or streaming
–License not detailed in description
–No built-in train/validation splits mentioned

Did you find this helpful?

Frequently asked questions

A 5 trillion token language modeling dataset built from Zyda, FineWeb, DCLM, and Dolma with deduplication and quality filtering.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote Zyda-2

Add this badge to your website, or share the tool.

DFeatured on DhanasviZyda-2 0

Zyda-2

A 5-trillion-token dataset for language model pretraining from combined open sources.

DatasetText & NLP↓ 177K/moFree

Open dataset

Updated 2026-06-18

What is Zyda-2?

The dataset supports pretraining of large language models for text generation. It is intended for researchers and developers building models that require broad, high-quality token coverage.

What you can build with Zyda-2

Pretrain large language models

Train 7B+ parameter models from scratch on the full 5T tokens for general-purpose text generation.

Fine-tune code and math assistants

Use the filtered code, math, and scientific subsets to adapt models for programming or STEM tasks.

Benchmark data filtering pipelines

Compare model-based quality filters and cross-deduplication effects against raw web crawls.

Load Zyda-2

Python

from datasets import load_dataset

ds = load_dataset("jobs-git/Zyda-2")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('jobs-git/zyda-2', streaming=True)
4Access the train split and iterate over examples
5Tokenize with your tokenizer and feed to a training loop

Zyda-2: pros & cons

Pros

+5 trillion token scale
+Cross-deduplicated across four sources
+Model-based quality filtering applied
+Mix of web, code, math, and papers

Cons

–Requires massive storage or streaming
–License not detailed in description
–No built-in train/validation splits mentioned

Did you find this helpful?

Frequently asked questions

A 5 trillion token language modeling dataset built from Zyda, FineWeb, DCLM, and Dolma with deduplication and quality filtering.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote Zyda-2

Add this badge to your website, or share the tool.

DFeatured on DhanasviZyda-2 0

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pretrain large language models

Fine-tune code and math assistants

Benchmark data filtering pipelines

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pretrain large language models

Fine-tune code and math assistants

Benchmark data filtering pipelines

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pretrain large language models

Fine-tune code and math assistants

Benchmark data filtering pipelines

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

What is Zyda-2?

Is Zyda-2 free?

How do I access Zyda-2?

What license does it use?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pretrain large language models

Fine-tune code and math assistants

Benchmark data filtering pipelines

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

What is Zyda-2?

Is Zyda-2 free?

How do I access Zyda-2?

What license does it use?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote Zyda-2