c4
VerifiedCleaned Common Crawl corpus with multiple language variants for NLP training.
What is c4?
C4 provides a colossal cleaned web crawl corpus based on Common Crawl, offered in English and multilingual subsets with specified sizes for each variant.
It is useful for researchers and developers training large language models on diverse web text for generation and masking objectives.
What you can build with c4
Pretrain language models
Use the en or realnewslike variants to pretrain transformer models like BERT or GPT-style architectures on massive web text for downstream NLP tasks.
Train text generation systems
Fine-tune on the multilingual mC4 split to build multilingual generation models that handle diverse languages from Common Crawl sources.
Benchmark data cleaning pipelines
Compare model performance across en.noclean and en.noblocklist variants to evaluate the impact of filtering heuristics on training data quality.
Load c4
from datasets import load_dataset
ds = load_dataset("allenai/c4")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('c4', 'en', streaming=True)
- 4Access splits with dataset['train'] or dataset['validation']
- 5Iterate over examples for tokenization and training loops
c4: pros & cons
Pros
- +Massive scale with variants up to 9.7TB
- +Multiple filtered versions for different needs
- +Directly loadable via Hugging Face datasets
- +Covers text-generation and fill-mask use cases
Cons
- –Extremely large downloads require significant storage
- –Web-sourced data can contain noise and biases
- –English-dominant with limited non-English coverage outside mC4
Frequently asked questions
A processed version of Google's C4 corpus derived from Common Crawl, offered in five variants including cleaned English and multilingual versions.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…