Skip to content
dclm-baseline-1.0 logo

dclm-baseline-1.0

Verified

4T-token open pretraining dataset for competitive 7B language models.

DatasetAI & Machine Learning595K/moFree
Open dataset
Updated 2026-06-15

What is dclm-baseline-1.0?

DCLM-baseline is a 4T-token pretraining corpus built from 3 billion documents and intended for training large language models.

It is useful for researchers and organizations that require an open, high-volume dataset to train or benchmark 7B-scale models against closed-data baselines.

What you can build with dclm-baseline-1.0

Pretrain 7B-scale LLMs

Train base language models from scratch on 4T tokens to reach competitive CORE, MMLU and EXTENDED scores without relying on proprietary data mixtures.

Data ablation studies

Compare filtering and deduplication pipelines by swapping in DCLM-baseline subsets and measuring downstream benchmark deltas.

Open replication of closed models

Reproduce or exceed the performance of Llama-3 8B or QWEN-2 7B using only publicly released documents and the provided 3B-document corpus.

Load dclm-baseline-1.0

Python
from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-baseline-1.0")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3ds = load_dataset('mlfoundations/dclm-baseline-1.0', split='train', streaming=True)
  4. 4for batch in ds.iter(batch_size=1024): process(batch)
  5. 5Use the 'text' field for next-token prediction training

dclm-baseline-1.0: pros & cons

Pros

  • +4 trillion tokens / 3 billion documents at open weights
  • +Demonstrated strong 7B-regime benchmark results
  • +Ready-to-use on Hugging Face datasets library
  • +Enables fully open-data model training

Cons

  • Extremely large; full download requires significant storage and bandwidth
  • No domain tags or metadata beyond raw text
  • License and exact curation details not specified on card
Did you find this helpful?

Frequently asked questions

A 4-trillion-token pretraining corpus containing 3 billion documents, released to train competitive open language models.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote dclm-baseline-1.0

Add this badge to your website, or share the tool.

DFeatured on Dhanasvidclm-baseline-1.0 0