Is Zyda-2 free to use?

Yes, it is hosted on the Hugging Face Hub and accessible at no cost via the datasets library.

How do I access Zyda-2?

Load it directly with the Hugging Face datasets library using load_dataset('Zyphra/Zyda-2').

What license applies to Zyda-2?

License details are not provided in the dataset description; check the repository for terms before commercial use.

Zyda-2

A 5-trillion-token dataset blending web, math, code, and scientific sources for language modeling.

DatasetText & NLP↓ 212K/moFree

Open dataset

Updated 2026-06-18

What is Zyda-2?

Zyda-2 aggregates and refines multiple open-source datasets into a unified 5-trillion-token corpus. Sources include web crawls, educational materials, mathematical texts, programming code, and academic papers, processed via deduplication and model-based filtering.

It is designed for training large language models, with demonstrated performance gains over models trained on its source datasets alone. Researchers and developers working on text generation can leverage its diverse, high-quality content for improved model capabilities.

What you can build with Zyda-2

Pre-train large language models

Use the 5T token corpus to train or continue pre-training foundation models on a mix of web, math, code, and scientific text.

Domain-adapted text generation

Fine-tune smaller models on the filtered educational, math, and code subsets for specialized generation tasks.

Data quality research

Study effects of cross-deduplication and quality filtering by comparing model performance on Zyda-2 versus its source datasets.

Load Zyda-2

Python

from datasets import load_dataset

ds = load_dataset("Zyphra/Zyda-2")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('Zyphra/Zyda-2', split='train', streaming=True)
4Iterate over the dataset with a DataLoader or custom training loop
5Apply additional tokenization or filtering as needed for your tokenizer

Zyda-2: pros & cons

Pros

+5 trillion tokens after deduplication
+Diverse mix of web, math, code, and papers
+Quality-filtered and cross-deduplicated
+Directly loadable via Hugging Face datasets

Cons

–Extremely large size requires streaming or significant storage
–License and exact terms not specified in dataset card
–No built-in train/validation splits documented

Did you find this helpful?

Frequently asked questions

A 5-trillion-token language modeling dataset assembled from Zyda, FineWeb, DCLM, and Dolma with web, educational, math, code, and scientific content.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote Zyda-2

Add this badge to your website, or share the tool.

DFeatured on DhanasviZyda-2 0

Zyda-2

A 5-trillion-token dataset blending web, math, code, and scientific sources for language modeling.

DatasetText & NLP↓ 212K/moFree

Open dataset

Updated 2026-06-18

What is Zyda-2?

What you can build with Zyda-2

Pre-train large language models

Use the 5T token corpus to train or continue pre-training foundation models on a mix of web, math, code, and scientific text.

Domain-adapted text generation

Fine-tune smaller models on the filtered educational, math, and code subsets for specialized generation tasks.

Data quality research

Study effects of cross-deduplication and quality filtering by comparing model performance on Zyda-2 versus its source datasets.

Load Zyda-2

Python

from datasets import load_dataset

ds = load_dataset("Zyphra/Zyda-2")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('Zyphra/Zyda-2', split='train', streaming=True)
4Iterate over the dataset with a DataLoader or custom training loop
5Apply additional tokenization or filtering as needed for your tokenizer

Zyda-2: pros & cons

Pros

+5 trillion tokens after deduplication
+Diverse mix of web, math, code, and papers
+Quality-filtered and cross-deduplicated
+Directly loadable via Hugging Face datasets

Cons

–Extremely large size requires streaming or significant storage
–License and exact terms not specified in dataset card
–No built-in train/validation splits documented

Did you find this helpful?

Frequently asked questions

A 5-trillion-token language modeling dataset assembled from Zyda, FineWeb, DCLM, and Dolma with web, educational, math, code, and scientific content.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote Zyda-2

Add this badge to your website, or share the tool.

DFeatured on DhanasviZyda-2 0

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pre-train large language models

Domain-adapted text generation

Data quality research

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pre-train large language models

Domain-adapted text generation

Data quality research

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pre-train large language models

Domain-adapted text generation

Data quality research

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

What is Zyda-2?

Is Zyda-2 free to use?

How do I access Zyda-2?

What license applies to Zyda-2?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote Zyda-2

Zyda-2

What is Zyda-2?

What you can build with Zyda-2

Pre-train large language models

Domain-adapted text generation

Data quality research

Load Zyda-2

Zyda-2: pros & cons

Pros

Cons

Frequently asked questions

What is Zyda-2?

Is Zyda-2 free to use?

How do I access Zyda-2?

What license applies to Zyda-2?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote Zyda-2