Skip to content
dclm-pool-7b-2x logo

dclm-pool-7b-2x

Verified

Large-scale web text dataset for LLM pretraining by mlfoundations.

DatasetAI & Machine Learning143K/moFree
Open dataset
Updated 2026-06-16

What is dclm-pool-7b-2x?

It is a web-text corpus tagged for LLM pretraining and data composition experiments.

It is useful for researchers conducting large-scale language model training on Hugging Face.

What you can build with dclm-pool-7b-2x

Pretrain 7B-scale language models

Use the pool as the primary training corpus to train or continue-pretrain decoder-only models around 7 billion parameters.

Data filtering and ablation studies

Run experiments that subsample or re-weight portions of the pool to measure the impact of different curation strategies on downstream performance.

Synthetic data mixing research

Combine slices of the pool with other public datasets to study optimal mixing ratios for instruction-tuned or domain-adapted models.

Load dclm-pool-7b-2x

Python
from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-pool-7b-2x")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3ds = load_dataset('mlfoundations/dclm-pool-7b-2x', split='train')
  4. 4Iterate over the dataset or stream it with streaming=True for large-scale training
  5. 5Save filtered subsets locally with ds.save_to_disk()

dclm-pool-7b-2x: pros & cons

Pros

  • +Designed specifically for 7B-scale LLM pretraining
  • +Publicly available via Hugging Face
  • +Large, curated web-text pool
  • +Supports streaming for memory-efficient access

Cons

  • Exact composition and filtering details not documented in the provided metadata
  • Size likely requires substantial storage and compute
  • License and redistribution terms not specified
Did you find this helpful?

Frequently asked questions

A large-scale text dataset released by mlfoundations intended for pretraining language models around the 7B parameter scale.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote dclm-pool-7b-2x

Add this badge to your website, or share the tool.

DFeatured on Dhanasvidclm-pool-7b-2x 0