essential-web-v1.0
Verified24-trillion-token web dataset with metadata for custom curation.
What is essential-web-v1.0?
Essential-Web v1.0 is a web dataset of 24 trillion tokens across 23.6 billion documents, each annotated with document-level metadata.
It is useful for researchers who need to filter and curate custom training datasets based on classification, page type, complexity, or quality attributes.
What you can build with essential-web-v1.0
Pretrain large language models
Use the full 24T tokens or filtered high-quality subsets to train or continue-pretrain foundation models at scale.
Create domain-specific corpora
Filter by subject classification and quality scores to build specialized datasets for code, science, or legal model training.
Benchmark data filtering pipelines
Leverage built-in metadata on complexity and web page type to test and compare custom quality filtering strategies.
Load essential-web-v1.0
from datasets import load_dataset
ds = load_dataset("EssentialAI/essential-web-v1.0")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('EssentialAI/essential-web-v1.0', split='train', streaming=True)
- 4Filter rows using the subject, quality_score, and complexity metadata fields
- 5Save filtered subsets locally with ds.save_to_disk()
essential-web-v1.0: pros & cons
Pros
- +Extremely large scale (24T tokens)
- +Rich per-document metadata for filtering
- +Supports streaming to avoid full download
- +Designed for creating specialized subsets
Cons
- –Full dataset too large for most local machines
- –Requires significant bandwidth and storage even for subsets
- –Web-sourced data may contain biases or low-quality text
Frequently asked questions
A 24-trillion-token web dataset with 23.6 billion documents, each annotated with subject, page type, complexity, and quality metadata.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…