Is this dataset free to use?

Yes, it is hosted publicly on the Hugging Face Hub and can be downloaded at no cost.

How do I access the dataset?

Load it directly with the Hugging Face datasets library using load_dataset('mlfoundations/dclm-pool-7b-2x').

What license applies to this data?

License information is not provided in the current dataset card; users should check the repository for terms before commercial use.

dclm-pool-7b-2x — Free Dataset Docs, Examples & Alternatives (2026)

What is dclm-pool-7b-2x?

It is a web-text corpus tagged for LLM pretraining and data composition experiments.

It is useful for researchers conducting large-scale language model training on Hugging Face.

What you can build with dclm-pool-7b-2x

Pretrain 7B-scale language models

Use the pool as the primary training corpus to train or continue-pretrain decoder-only models around 7 billion parameters.

Data filtering and ablation studies

Run experiments that subsample or re-weight portions of the pool to measure the impact of different curation strategies on downstream performance.

Synthetic data mixing research

Combine slices of the pool with other public datasets to study optimal mixing ratios for instruction-tuned or domain-adapted models.

Load dclm-pool-7b-2x

Python

from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-pool-7b-2x")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('mlfoundations/dclm-pool-7b-2x', split='train')
4Iterate over the dataset or stream it with streaming=True for large-scale training
5Save filtered subsets locally with ds.save_to_disk()

dclm-pool-7b-2x: pros & cons

Pros

+Designed specifically for 7B-scale LLM pretraining
+Publicly available via Hugging Face
+Large, curated web-text pool
+Supports streaming for memory-efficient access

Cons

–Exact composition and filtering details not documented in the provided metadata
–Size likely requires substantial storage and compute
–License and redistribution terms not specified

Did you find this helpful?

Frequently asked questions

A large-scale text dataset released by mlfoundations intended for pretraining language models around the 7B parameter scale.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other ai & machine learning options worth comparing.

FineNews

AI & Machine Learning · ksolovev

Verified

News dataset for AI and machine learning workflows.

Dataset↓ 1.5MFree

hd_tmp

AI & Machine Learning · ayuo

Verified

Temporary AI/ML dataset for Hugging Face prototyping.

Dataset↓ 1.5MFree

results

AI & Machine Learning · mteb

Verified

MTEB benchmark results for text embedding model evaluations.

Dataset↓ 1.3MFree

dclm-pool-7b-2x

What is dclm-pool-7b-2x?

What you can build with dclm-pool-7b-2x

Pretrain 7B-scale language models

Data filtering and ablation studies

Synthetic data mixing research

Load dclm-pool-7b-2x

dclm-pool-7b-2x: pros & cons

Pros

Cons

Frequently asked questions

What is dclm-pool-7b-2x?

Is this dataset free to use?

How do I access the dataset?

What license applies to this data?

User reviews

Similar datasets

FineNews

hd_tmp

results

Promote dclm-pool-7b-2x