FineFineWeb
VerifiedFine-grained multi-domain web corpus with iteration-wise token statistics.
What is FineFineWeb?
FineFineWeb is a web-derived text collection partitioned by fine-grained topical domains and accompanied by detailed token and sample statistics for each of three processing iterations.
It supports researchers and practitioners who need domain-specific web text for training, filtering, or benchmarking classification and generation models.
What you can build with FineFineWeb
Domain-specific pretraining
Train or continue-pretrain language models on targeted subsets such as aerospace or agronomy using the per-domain token counts provided.
Data filtering experiments
Compare the three iteration snapshots to study how successive filtering passes affect corpus quality and domain coverage.
Benchmark construction
Sample balanced or imbalanced domain slices to create evaluation sets that test model performance across fine-grained web topics.
Load FineFineWeb
from datasets import load_dataset
ds = load_dataset("m-a-p/FineFineWeb")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('m-a-p/FineFineWeb')
- 4Access domain splits via the 'domain' column or load specific configs if provided
- 5Iterate over the 'train' split and filter by domain name for custom subsets
FineFineWeb: pros & cons
Pros
- +Multi-billion-token web corpus with explicit domain labels
- +Three successive filtering iterations allow quality ablation studies
- +Token and sample counts reported per domain for easy budgeting
- +Directly loadable via the Hugging Face datasets library
Cons
- –Project page, arXiv paper and license details still marked 'coming soon'
- –Extremely large total size may require substantial storage and compute
- –Only web-sourced text; no curated or synthetic data included
Frequently asked questions
A large-scale, domain-labeled web text corpus released by m-a-p containing over 20 billion tokens across dozens of fine-grained domains with three filtering iterations.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…