fineweb-edu
Verified1.3 trillion tokens of educational web pages filtered from FineWeb.
What is fineweb-edu?
FineWeb-Edu is a 1.3-trillion-token subset of FineWeb consisting of web pages scored for educational value by a classifier derived from Llama3-70B-Instruct annotations.
The dataset is intended for training or fine-tuning language models on educational content and is categorized for text-generation workloads.
Data preview
A real sample from the dataset — 11 columns.
| textstring | idstring | dumpstring | urlstring | datestring | file_pathstring |
|---|---|---|---|---|---|
| No. 24; Updated March 2011 Click here to download and print a PDF version of this document. Parents are usually the first to recognize that their child has a problem with emotions or behavior. Still, | <urn:uuid:673b1bf6-2c30-40ae-992b-c387d00a836a> | CC-MAIN-2013-20 | http://aacap.org/page.ww?name=When+to+Seek+Help+for+Your+Child§ion=Facts+for+Families | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | |
| Previous abstract Next abstract Session 40 - The Interstellar Medium. Display session, Tuesday, June 09 Gamma Ray Burst (GRB) explosions can make kpc-size shells and holes in the interstellar media (I | <urn:uuid:e2300ad5-01dd-4e80-92b3-7ec88785cc9d> | CC-MAIN-2013-20 | http://aas.org/archives/BAAS/v30n2/aas192/abs/S040015.html | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | |
| Question: How is bipolar disorder different from unipolar depression or 'regular' depression? Answer: Both bipolar disorder and major depression are typically associated with depressive episodes. So b | <urn:uuid:e6ba92ad-ed0a-4cac-8e5d-204b78cdd250> | CC-MAIN-2013-20 | http://abcnews.go.com/Health/BipolarOverview/story?id=4359993 | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | |
| Making the Case for Action This fact sheet(pdf) and slide deck provide essential state-specific information that addresses the economic imperative, the equity imperative, and the expectations imperati | <urn:uuid:3b2c1a91-4f52-464d-ad69-49c1cbadaba8> | CC-MAIN-2013-20 | http://achieve.org/Idaho | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | |
| A land whose rich cultural heritage is discovered not only from within the walls of numerous museums, galleries and churches, many of which today, as zero category monuments are included in a part of | <urn:uuid:a69aabbc-f529-4d67-843a-a5c3cb4e8fe0> | CC-MAIN-2013-20 | http://adriatictraveller.com/ru/croatia-essential/heritage.html | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz |
Dataset structure
| Subset | Split | Rows |
|---|---|---|
| default | train | 1,525,223,056 |
| CC-MAIN-2013-20 | train | 11,002,672 |
| CC-MAIN-2013-48 | train | 10,797,532 |
| CC-MAIN-2014-10 | train | 10,987,331 |
| CC-MAIN-2014-15 | train | 10,372,732 |
| CC-MAIN-2014-23 | train | 11,739,487 |
| CC-MAIN-2014-35 | train | 11,107,062 |
| CC-MAIN-2014-41 | train | 11,439,517 |
| CC-MAIN-2014-42 | train | 10,668,528 |
| CC-MAIN-2014-49 | train | 9,388,950 |
| CC-MAIN-2014-52 | train | 11,198,125 |
| CC-MAIN-2015-06 | train | 10,328,622 |
What you can build with fineweb-edu
Train educational LLMs
Use the 1.3T tokens to pre-train or continue-train language models specialized in generating learning materials and explanations.
Build quality-filtered training sets
Select high-scoring subsets to create smaller, cleaner corpora for fine-tuning text-generation models in education.
Benchmark educational content classifiers
Leverage the existing quality labels to evaluate or improve new models that score web text for educational value.
Load fineweb-edu
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb-edu")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('HuggingFaceFW/fineweb-edu', split='train')
- 4Iterate over samples or stream with streaming=True for large scale
- 5Filter by the 'score' column to select educational quality tiers
fineweb-edu: pros & cons
Pros
- +1.3 trillion tokens of web text
- +Educational quality labels from Llama-3-70B
- +Ready for text-generation NLP tasks
- +Directly loadable via Hugging Face datasets
Cons
- –Extremely large; requires significant storage/compute
- –Labels are model-generated and may carry biases
- –Contains only web-sourced text
Frequently asked questions
A 1.3T-token web corpus filtered for educational quality using a classifier trained on Llama-3-70B annotations of the original FineWeb data.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…