fineweb
VerifiedCleaned and deduplicated English web data from CommonCrawl for LLM training.
What is fineweb?
FineWeb consists of more than 18.5 trillion tokens of cleaned and deduplicated English web data extracted from CommonCrawl. The dataset was processed using the datatrove library with optimizations aimed at improving performance in large language model training.
It supports text-generation tasks and is useful for researchers and developers building or evaluating large language models.
Dataset structure
| Subset | Split | Rows |
|---|---|---|
| default | train | 25,886,364,489 |
| CC-MAIN-2013-20 | train | 215,280,647 |
| CC-MAIN-2013-48 | train | 220,361,877 |
| CC-MAIN-2014-10 | train | 216,533,931 |
| CC-MAIN-2014-15 | train | 200,396,707 |
| CC-MAIN-2014-23 | train | 234,422,740 |
| CC-MAIN-2014-35 | train | 216,436,591 |
| CC-MAIN-2014-41 | train | 223,672,312 |
| CC-MAIN-2014-42 | train | 200,827,578 |
| CC-MAIN-2014-49 | train | 171,461,960 |
| CC-MAIN-2014-52 | train | 211,679,967 |
| CC-MAIN-2015-06 | train | 193,273,022 |
What you can build with fineweb
Pretrain large language models
Use the 18.5T tokens of cleaned English web text as the primary pretraining corpus for building or replicating LLMs from scratch.
Benchmark data filtering pipelines
Compare custom cleaning or deduplication methods against the datatrove-based pipeline used to create FineWeb.
Continued pretraining or domain adaptation
Load subsets of FineWeb to continue pretraining existing English models on fresh web-scale data.
Load fineweb
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('HuggingFaceFW/fineweb', streaming=True)
- 4Access splits or samples via dataset['train']
- 5Iterate over documents for training loops
fineweb: pros & cons
Pros
- +Massive 18.5 trillion token scale
- +Cleaned and deduplicated with datatrove
- +Explicitly optimized for LLM performance
- +Open replication of RefinedWeb
Cons
- –English only
- –Extremely large size requires heavy resources
- –Web-sourced data may contain biases
Frequently asked questions
A dataset of over 18.5 trillion tokens of cleaned English web text from CommonCrawl, created as an open replication of RefinedWeb.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…