A 5-trillion-token dataset blending web, math, code, and scientific sources for language modeling.
Zyda-2 aggregates and refines multiple open-source datasets into a unified 5-trillion-token corpus. Sources include web crawls, educational materials, mathematical texts, programming code, and academic papers, processed via deduplication and model-based filtering.
It is designed for training large language models, with demonstrated performance gains over models trained on its source datasets alone. Researchers and developers working on text generation can leverage its diverse, high-quality content for improved model capabilities.
Use the 5T token corpus to train or continue pre-training foundation models on a mix of web, math, code, and scientific text.
Fine-tune smaller models on the filtered educational, math, and code subsets for specialized generation tasks.
Study effects of cross-deduplication and quality filtering by comparing model performance on Zyda-2 versus its source datasets.
from datasets import load_dataset
ds = load_dataset("Zyphra/Zyda-2")A 5-trillion-token language modeling dataset assembled from Zyda, FineWeb, DCLM, and Dolma with web, educational, math, code, and scientific content.
Verified reviews from the community shape this listing's rating.
Loading reviews…