Skip to content
fineweb-edu logo

fineweb-edu

Verified

1.3 trillion tokens of educational web pages filtered from FineWeb.

DatasetText & NLP476K/moFree
Open dataset
Updated 2026-06-15

What is fineweb-edu?

FineWeb-Edu is a 1.3-trillion-token subset of FineWeb consisting of web pages scored for educational value by a classifier derived from Llama3-70B-Instruct annotations.

The dataset is intended for training or fine-tuning language models on educational content and is categorized for text-generation workloads.

Data preview

A real sample from the dataset — 11 columns.

textstringidstringdumpstringurlstringdatestringfile_pathstring
No. 24; Updated March 2011 Click here to download and print a PDF version of this document. Parents are usually the first to recognize that their child has a problem with emotions or behavior. Still, <urn:uuid:673b1bf6-2c30-40ae-992b-c387d00a836a>CC-MAIN-2013-20http://aacap.org/page.ww?name=When+to+Seek+Help+for+Your+Child&section=Facts+for+Familiess3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
Previous abstract Next abstract Session 40 - The Interstellar Medium. Display session, Tuesday, June 09 Gamma Ray Burst (GRB) explosions can make kpc-size shells and holes in the interstellar media (I<urn:uuid:e2300ad5-01dd-4e80-92b3-7ec88785cc9d>CC-MAIN-2013-20http://aas.org/archives/BAAS/v30n2/aas192/abs/S040015.htmls3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
Question: How is bipolar disorder different from unipolar depression or 'regular' depression? Answer: Both bipolar disorder and major depression are typically associated with depressive episodes. So b<urn:uuid:e6ba92ad-ed0a-4cac-8e5d-204b78cdd250>CC-MAIN-2013-20http://abcnews.go.com/Health/BipolarOverview/story?id=4359993s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
Making the Case for Action This fact sheet(pdf) and slide deck provide essential state-specific information that addresses the economic imperative, the equity imperative, and the expectations imperati<urn:uuid:3b2c1a91-4f52-464d-ad69-49c1cbadaba8>CC-MAIN-2013-20http://achieve.org/Idahos3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
A land whose rich cultural heritage is discovered not only from within the walls of numerous museums, galleries and churches, many of which today, as zero category monuments are included in a part of <urn:uuid:a69aabbc-f529-4d67-843a-a5c3cb4e8fe0>CC-MAIN-2013-20http://adriatictraveller.com/ru/croatia-essential/heritage.htmls3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz

Dataset structure

Total rows
3,496,736,741
Columns
11
Size on disk
9.4 TB
SubsetSplitRows
defaulttrain1,525,223,056
CC-MAIN-2013-20train11,002,672
CC-MAIN-2013-48train10,797,532
CC-MAIN-2014-10train10,987,331
CC-MAIN-2014-15train10,372,732
CC-MAIN-2014-23train11,739,487
CC-MAIN-2014-35train11,107,062
CC-MAIN-2014-41train11,439,517
CC-MAIN-2014-42train10,668,528
CC-MAIN-2014-49train9,388,950
CC-MAIN-2014-52train11,198,125
CC-MAIN-2015-06train10,328,622

What you can build with fineweb-edu

Train educational LLMs

Use the 1.3T tokens to pre-train or continue-train language models specialized in generating learning materials and explanations.

Build quality-filtered training sets

Select high-scoring subsets to create smaller, cleaner corpora for fine-tuning text-generation models in education.

Benchmark educational content classifiers

Leverage the existing quality labels to evaluate or improve new models that score web text for educational value.

Load fineweb-edu

Python
from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/fineweb-edu")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3dataset = load_dataset('HuggingFaceFW/fineweb-edu', split='train')
  4. 4Iterate over samples or stream with streaming=True for large scale
  5. 5Filter by the 'score' column to select educational quality tiers

fineweb-edu: pros & cons

Pros

  • +1.3 trillion tokens of web text
  • +Educational quality labels from Llama-3-70B
  • +Ready for text-generation NLP tasks
  • +Directly loadable via Hugging Face datasets

Cons

  • Extremely large; requires significant storage/compute
  • Labels are model-generated and may carry biases
  • Contains only web-sourced text
Did you find this helpful?

Frequently asked questions

A 1.3T-token web corpus filtered for educational quality using a classifier trained on Llama-3-70B annotations of the original FineWeb data.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote fineweb-edu

Add this badge to your website, or share the tool.

DFeatured on Dhanasvifineweb-edu 0