Is the dataset free to use?

Yes, it is publicly hosted on Hugging Face and can be streamed or downloaded at no cost.

What license applies?

License information is not stated on the dataset page; users should verify terms before commercial use.

Use the Hugging Face datasets library with load_dataset('mlfoundations/dclm-baseline-1.0'). Streaming mode is recommended due to size.

dclm-baseline-1.0 — Free Dataset Docs, Examples & Alternatives (2026)

What is dclm-baseline-1.0?

DCLM-baseline is a 4T-token pretraining corpus built from 3 billion documents and intended for training large language models.

It is useful for researchers and organizations that require an open, high-volume dataset to train or benchmark 7B-scale models against closed-data baselines.

What you can build with dclm-baseline-1.0

Pretrain 7B-scale LLMs

Train base language models from scratch on 4T tokens to reach competitive CORE, MMLU and EXTENDED scores without relying on proprietary data mixtures.

Data ablation studies

Compare filtering and deduplication pipelines by swapping in DCLM-baseline subsets and measuring downstream benchmark deltas.

Open replication of closed models

Reproduce or exceed the performance of Llama-3 8B or QWEN-2 7B using only publicly released documents and the provided 3B-document corpus.

Load dclm-baseline-1.0

Python

from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-baseline-1.0")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('mlfoundations/dclm-baseline-1.0', split='train', streaming=True)
4for batch in ds.iter(batch_size=1024): process(batch)
5Use the 'text' field for next-token prediction training

dclm-baseline-1.0: pros & cons

Pros

+4 trillion tokens / 3 billion documents at open weights
+Demonstrated strong 7B-regime benchmark results
+Ready-to-use on Hugging Face datasets library
+Enables fully open-data model training

Cons

–Extremely large; full download requires significant storage and bandwidth
–No domain tags or metadata beyond raw text
–License and exact curation details not specified on card

Did you find this helpful?

Frequently asked questions

A 4-trillion-token pretraining corpus containing 3 billion documents, released to train competitive open language models.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other ai & machine learning options worth comparing.

FineNews

AI & Machine Learning · ksolovev

Verified

News dataset for AI and machine learning workflows.

Dataset↓ 1.5MFree

hd_tmp

AI & Machine Learning · ayuo

Verified

Temporary AI/ML dataset for Hugging Face prototyping.

Dataset↓ 1.5MFree

results

AI & Machine Learning · mteb

Verified

MTEB benchmark results for text embedding model evaluations.

Dataset↓ 1.3MFree

dclm-baseline-1.0

What is dclm-baseline-1.0?

What you can build with dclm-baseline-1.0

Pretrain 7B-scale LLMs

Data ablation studies

Open replication of closed models

Load dclm-baseline-1.0

dclm-baseline-1.0: pros & cons

Pros

Cons

Frequently asked questions

What is DCLM-baseline-1.0?

Is the dataset free to use?

What license applies?

How do I load it?

User reviews

Similar datasets

FineNews

hd_tmp

results

Promote dclm-baseline-1.0