Skip to content
c4 logo

c4

Verified

Cleaned Common Crawl corpus with multiple language variants for NLP training.

DatasetText & NLP827K/moFree
Open dataset
Updated 2026-06-15

What is c4?

C4 provides a colossal cleaned web crawl corpus based on Common Crawl, offered in English and multilingual subsets with specified sizes for each variant.

It is useful for researchers and developers training large language models on diverse web text for generation and masking objectives.

What you can build with c4

Pretrain language models

Use the en or realnewslike variants to pretrain transformer models like BERT or GPT-style architectures on massive web text for downstream NLP tasks.

Train text generation systems

Fine-tune on the multilingual mC4 split to build multilingual generation models that handle diverse languages from Common Crawl sources.

Benchmark data cleaning pipelines

Compare model performance across en.noclean and en.noblocklist variants to evaluate the impact of filtering heuristics on training data quality.

Load c4

Python
from datasets import load_dataset

ds = load_dataset("allenai/c4")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3dataset = load_dataset('c4', 'en', streaming=True)
  4. 4Access splits with dataset['train'] or dataset['validation']
  5. 5Iterate over examples for tokenization and training loops

c4: pros & cons

Pros

  • +Massive scale with variants up to 9.7TB
  • +Multiple filtered versions for different needs
  • +Directly loadable via Hugging Face datasets
  • +Covers text-generation and fill-mask use cases

Cons

  • Extremely large downloads require significant storage
  • Web-sourced data can contain noise and biases
  • English-dominant with limited non-English coverage outside mC4
Did you find this helpful?

Frequently asked questions

A processed version of Google's C4 corpus derived from Common Crawl, offered in five variants including cleaned English and multilingual versions.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote c4

Add this badge to your website, or share the tool.

DFeatured on Dhanasvic4 1