Skip to content
fineweb logo

fineweb

Verified

Cleaned and deduplicated English web data from CommonCrawl for LLM training.

DatasetText & NLP442K/moFree
Open dataset
Updated 2026-06-15

What is fineweb?

FineWeb consists of more than 18.5 trillion tokens of cleaned and deduplicated English web data extracted from CommonCrawl. The dataset was processed using the datatrove library with optimizations aimed at improving performance in large language model training.

It supports text-generation tasks and is useful for researchers and developers building or evaluating large language models.

Dataset structure

Total rows
52,453,695,892
Columns
9
Size on disk
98.4 TB
SubsetSplitRows
defaulttrain25,886,364,489
CC-MAIN-2013-20train215,280,647
CC-MAIN-2013-48train220,361,877
CC-MAIN-2014-10train216,533,931
CC-MAIN-2014-15train200,396,707
CC-MAIN-2014-23train234,422,740
CC-MAIN-2014-35train216,436,591
CC-MAIN-2014-41train223,672,312
CC-MAIN-2014-42train200,827,578
CC-MAIN-2014-49train171,461,960
CC-MAIN-2014-52train211,679,967
CC-MAIN-2015-06train193,273,022

What you can build with fineweb

Pretrain large language models

Use the 18.5T tokens of cleaned English web text as the primary pretraining corpus for building or replicating LLMs from scratch.

Benchmark data filtering pipelines

Compare custom cleaning or deduplication methods against the datatrove-based pipeline used to create FineWeb.

Continued pretraining or domain adaptation

Load subsets of FineWeb to continue pretraining existing English models on fresh web-scale data.

Load fineweb

Python
from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/fineweb")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3dataset = load_dataset('HuggingFaceFW/fineweb', streaming=True)
  4. 4Access splits or samples via dataset['train']
  5. 5Iterate over documents for training loops

fineweb: pros & cons

Pros

  • +Massive 18.5 trillion token scale
  • +Cleaned and deduplicated with datatrove
  • +Explicitly optimized for LLM performance
  • +Open replication of RefinedWeb

Cons

  • English only
  • Extremely large size requires heavy resources
  • Web-sourced data may contain biases
Did you find this helpful?

Frequently asked questions

A dataset of over 18.5 trillion tokens of cleaned English web text from CommonCrawl, created as an open replication of RefinedWeb.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote fineweb

Add this badge to your website, or share the tool.

DFeatured on Dhanasvifineweb 0