Is FineWeb free to use?

Yes, it is hosted publicly on the Hugging Face Hub and accessible via the datasets library at no cost.

How do I access FineWeb?

Load it directly with the Hugging Face datasets library using load_dataset('HuggingFaceFW/fineweb').

What is the license for FineWeb?

Check the dataset card on the Hugging Face Hub for the specific license and usage terms.

fineweb — Free Dataset Docs, Examples & Alternatives (2026)

What is fineweb?

FineWeb consists of more than 18.5 trillion tokens of cleaned and deduplicated English web data extracted from CommonCrawl. The dataset was processed using the datatrove library with optimizations aimed at improving performance in large language model training.

It supports text-generation tasks and is useful for researchers and developers building or evaluating large language models.

Dataset structure

Total rows

52,453,695,892

Columns

9

Size on disk

98.4 TB

Subset	Split	Rows
default	train	25,886,364,489
CC-MAIN-2013-20	train	215,280,647
CC-MAIN-2013-48	train	220,361,877
CC-MAIN-2014-10	train	216,533,931
CC-MAIN-2014-15	train	200,396,707
CC-MAIN-2014-23	train	234,422,740
CC-MAIN-2014-35	train	216,436,591
CC-MAIN-2014-41	train	223,672,312
CC-MAIN-2014-42	train	200,827,578
CC-MAIN-2014-49	train	171,461,960
CC-MAIN-2014-52	train	211,679,967
CC-MAIN-2015-06	train	193,273,022

What you can build with fineweb

Pretrain large language models

Use the 18.5T tokens of cleaned English web text as the primary pretraining corpus for building or replicating LLMs from scratch.

Benchmark data filtering pipelines

Compare custom cleaning or deduplication methods against the datatrove-based pipeline used to create FineWeb.

Continued pretraining or domain adaptation

Load subsets of FineWeb to continue pretraining existing English models on fresh web-scale data.

Load fineweb

Python

from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/fineweb")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('HuggingFaceFW/fineweb', streaming=True)
4Access splits or samples via dataset['train']
5Iterate over documents for training loops

fineweb: pros & cons

Pros

+Massive 18.5 trillion token scale
+Cleaned and deduplicated with datatrove
+Explicitly optimized for LLM performance
+Open replication of RefinedWeb

Cons

–English only
–Extremely large size requires heavy resources
–Web-sourced data may contain biases

Did you find this helpful?

Frequently asked questions

A dataset of over 18.5 trillion tokens of cleaned English web text from CommonCrawl, created as an open replication of RefinedWeb.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other text & nlp options worth comparing.

KakologArchives

Text & NLP · KakologArchives

Verified

Archive of 11 years of Nico Nico Jikkyo live commentary logs.

Dataset↓ 1.8MFree

wikitext

Text & NLP · Salesforce

Verified

Over 100 million tokens from Wikipedia for language modeling benchmarks.

Dataset↓ 1.3MFree

gsm8k

Text & NLP · openai

Verified

8.5K grade school math word problems requiring multi-step arithmetic reasoning.

Dataset↓ 901KFree

fineweb

What is fineweb?

Dataset structure

What you can build with fineweb

Pretrain large language models

Benchmark data filtering pipelines

Continued pretraining or domain adaptation

Load fineweb

fineweb: pros & cons

Pros

Cons

Frequently asked questions

What is FineWeb?

Is FineWeb free to use?

How do I access FineWeb?

What is the license for FineWeb?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote fineweb