Cleaned web documents in 191 languages from the HPLT project.
HPLT2.0_cleaned is the cleaned variant of the HPLT Datasets v2.0. It contains web-crawled text documents across 191 languages, converted semi-automatically to Parquet from the original JSONL files produced by the HPLT project.
The resource is intended for multilingual NLP work including model pretraining and evaluation. It is suitable for researchers and developers building language models that require large-scale cleaned web text.
Use the trillion-token corpus to train or continue-pretrain large language models across 191 languages in a single pipeline.
Filter the Parquet files by language to create targeted training sets for low-resource languages before fine-tuning.
Leverage the cleaned documents directly for fill-mask pretraining objectives on diverse web text.
from datasets import load_dataset
ds = load_dataset("jobs-git/HPLT2.0_cleaned")A cleaned collection of web-crawled documents in 191 languages, mostly from Internet Archive plus Common Crawl, released in Parquet format and exceeding one trillion tokens.
Verified reviews from the community shape this listing's rating.
Loading reviews…