dclm-baseline-1.0
Verified4T-token open pretraining dataset for competitive 7B language models.
What is dclm-baseline-1.0?
DCLM-baseline is a 4T-token pretraining corpus built from 3 billion documents and intended for training large language models.
It is useful for researchers and organizations that require an open, high-volume dataset to train or benchmark 7B-scale models against closed-data baselines.
What you can build with dclm-baseline-1.0
Pretrain 7B-scale LLMs
Train base language models from scratch on 4T tokens to reach competitive CORE, MMLU and EXTENDED scores without relying on proprietary data mixtures.
Data ablation studies
Compare filtering and deduplication pipelines by swapping in DCLM-baseline subsets and measuring downstream benchmark deltas.
Open replication of closed models
Reproduce or exceed the performance of Llama-3 8B or QWEN-2 7B using only publicly released documents and the provided 3B-document corpus.
Load dclm-baseline-1.0
from datasets import load_dataset
ds = load_dataset("mlfoundations/dclm-baseline-1.0")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('mlfoundations/dclm-baseline-1.0', split='train', streaming=True)
- 4for batch in ds.iter(batch_size=1024): process(batch)
- 5Use the 'text' field for next-token prediction training
dclm-baseline-1.0: pros & cons
Pros
- +4 trillion tokens / 3 billion documents at open weights
- +Demonstrated strong 7B-regime benchmark results
- +Ready-to-use on Hugging Face datasets library
- +Enables fully open-data model training
Cons
- –Extremely large; full download requires significant storage and bandwidth
- –No domain tags or metadata beyond raw text
- –License and exact curation details not specified on card
Frequently asked questions
A 4-trillion-token pretraining corpus containing 3 billion documents, released to train competitive open language models.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…