Skip to content
essential-web-v1.0 logo

essential-web-v1.0

Verified

24-trillion-token web dataset with metadata for custom curation.

DatasetAI & Machine Learning293K/moFree
Open dataset
Updated 2026-06-15

What is essential-web-v1.0?

Essential-Web v1.0 is a web dataset of 24 trillion tokens across 23.6 billion documents, each annotated with document-level metadata.

It is useful for researchers who need to filter and curate custom training datasets based on classification, page type, complexity, or quality attributes.

What you can build with essential-web-v1.0

Pretrain large language models

Use the full 24T tokens or filtered high-quality subsets to train or continue-pretrain foundation models at scale.

Create domain-specific corpora

Filter by subject classification and quality scores to build specialized datasets for code, science, or legal model training.

Benchmark data filtering pipelines

Leverage built-in metadata on complexity and web page type to test and compare custom quality filtering strategies.

Load essential-web-v1.0

Python
from datasets import load_dataset

ds = load_dataset("EssentialAI/essential-web-v1.0")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3ds = load_dataset('EssentialAI/essential-web-v1.0', split='train', streaming=True)
  4. 4Filter rows using the subject, quality_score, and complexity metadata fields
  5. 5Save filtered subsets locally with ds.save_to_disk()

essential-web-v1.0: pros & cons

Pros

  • +Extremely large scale (24T tokens)
  • +Rich per-document metadata for filtering
  • +Supports streaming to avoid full download
  • +Designed for creating specialized subsets

Cons

  • Full dataset too large for most local machines
  • Requires significant bandwidth and storage even for subsets
  • Web-sourced data may contain biases or low-quality text
Did you find this helpful?

Frequently asked questions

A 24-trillion-token web dataset with 23.6 billion documents, each annotated with subject, page type, complexity, and quality metadata.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote essential-web-v1.0

Add this badge to your website, or share the tool.

DFeatured on Dhanasviessential-web-v1.0 0