dclm-pool-7b-2x
VerifiedLarge-scale web text dataset for LLM pretraining by mlfoundations.
What is dclm-pool-7b-2x?
It is a web-text corpus tagged for LLM pretraining and data composition experiments.
It is useful for researchers conducting large-scale language model training on Hugging Face.
What you can build with dclm-pool-7b-2x
Pretrain 7B-scale language models
Use the pool as the primary training corpus to train or continue-pretrain decoder-only models around 7 billion parameters.
Data filtering and ablation studies
Run experiments that subsample or re-weight portions of the pool to measure the impact of different curation strategies on downstream performance.
Synthetic data mixing research
Combine slices of the pool with other public datasets to study optimal mixing ratios for instruction-tuned or domain-adapted models.
Load dclm-pool-7b-2x
from datasets import load_dataset
ds = load_dataset("mlfoundations/dclm-pool-7b-2x")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('mlfoundations/dclm-pool-7b-2x', split='train')
- 4Iterate over the dataset or stream it with streaming=True for large-scale training
- 5Save filtered subsets locally with ds.save_to_disk()
dclm-pool-7b-2x: pros & cons
Pros
- +Designed specifically for 7B-scale LLM pretraining
- +Publicly available via Hugging Face
- +Large, curated web-text pool
- +Supports streaming for memory-efficient access
Cons
- –Exact composition and filtering details not documented in the provided metadata
- –Size likely requires substantial storage and compute
- –License and redistribution terms not specified
Frequently asked questions
A large-scale text dataset released by mlfoundations intended for pretraining language models around the 7B parameter scale.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…