Yes, it is publicly available through the Hugging Face Hub at no cost.

How do I access the dataset?

Load it directly with the Hugging Face datasets library using load_dataset('c4', ' ').

What license applies to C4?

It inherits the Common Crawl license; users should review terms for web-scraped content usage.

c4 — Free Dataset Docs, Examples & Alternatives (2026)

What is c4?

C4 provides a colossal cleaned web crawl corpus based on Common Crawl, offered in English and multilingual subsets with specified sizes for each variant.

It is useful for researchers and developers training large language models on diverse web text for generation and masking objectives.

What you can build with c4

Pretrain language models

Use the en or realnewslike variants to pretrain transformer models like BERT or GPT-style architectures on massive web text for downstream NLP tasks.

Train text generation systems

Fine-tune on the multilingual mC4 split to build multilingual generation models that handle diverse languages from Common Crawl sources.

Benchmark data cleaning pipelines

Compare model performance across en.noclean and en.noblocklist variants to evaluate the impact of filtering heuristics on training data quality.

Load c4

Python

from datasets import load_dataset

ds = load_dataset("allenai/c4")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('c4', 'en', streaming=True)
4Access splits with dataset['train'] or dataset['validation']
5Iterate over examples for tokenization and training loops

c4: pros & cons

Pros

+Massive scale with variants up to 9.7TB
+Multiple filtered versions for different needs
+Directly loadable via Hugging Face datasets
+Covers text-generation and fill-mask use cases

Cons

–Extremely large downloads require significant storage
–Web-sourced data can contain noise and biases
–English-dominant with limited non-English coverage outside mC4

Did you find this helpful?

Frequently asked questions

A processed version of Google's C4 corpus derived from Common Crawl, offered in five variants including cleaned English and multilingual versions.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other text & nlp options worth comparing.

KakologArchives

Text & NLP · KakologArchives

Verified

Archive of 11 years of Nico Nico Jikkyo live commentary logs.

Dataset↓ 1.8MFree

wikitext

Text & NLP · Salesforce

Verified

Over 100 million tokens from Wikipedia for language modeling benchmarks.

Dataset↓ 1.3MFree

gsm8k

Text & NLP · openai

Verified

8.5K grade school math word problems requiring multi-step arithmetic reasoning.

Dataset↓ 901KFree

c4

What is c4?

What you can build with c4

Pretrain language models

Train text generation systems

Benchmark data cleaning pipelines

Load c4

c4: pros & cons

Pros

Cons

Frequently asked questions

What is the C4 dataset?

Is C4 free to use?

How do I access the dataset?

What license applies to C4?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote c4