A 5-trillion-token dataset for language model pretraining from combined open sources.
Zyda-2 aggregates multiple open-source datasets into a single 5-trillion-token corpus via deduplication and filtering. It spans web pages, educational material, mathematics, code, and research papers.
The dataset supports pretraining of large language models for text generation. It is intended for researchers and developers building models that require broad, high-quality token coverage.
Train 7B+ parameter models from scratch on the full 5T tokens for general-purpose text generation.
Use the filtered code, math, and scientific subsets to adapt models for programming or STEM tasks.
Compare model-based quality filters and cross-deduplication effects against raw web crawls.
from datasets import load_dataset
ds = load_dataset("jobs-git/Zyda-2")A 5 trillion token language modeling dataset built from Zyda, FineWeb, DCLM, and Dolma with deduplication and quality filtering.
Verified reviews from the community shape this listing's rating.
Loading reviews…