How do I access the dataset?

Load it directly with the Hugging Face datasets library using load_dataset('jobs-git/HPLT2.0_cleaned').

Is HPLT2.0_cleaned free to use?

It is publicly available on Hugging Face; check the repository for any usage restrictions or license terms.

What tasks is it intended for?

It supports fill-mask and text-generation tasks.

HPLT2.0_cleaned

Cleaned web documents in 191 languages from the HPLT project.

DatasetText & NLP↓ 185K/moFree

Open dataset

Updated 2026-06-18

What is HPLT2.0_cleaned?

HPLT2.0_cleaned is the cleaned variant of the HPLT Datasets v2.0. It contains web-crawled text documents across 191 languages, converted semi-automatically to Parquet from the original JSONL files produced by the HPLT project.

The resource is intended for multilingual NLP work including model pretraining and evaluation. It is suitable for researchers and developers building language models that require large-scale cleaned web text.

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Use the trillion-token corpus to train or continue-pretrain large language models across 191 languages in a single pipeline.

Build language-specific corpora

Filter the Parquet files by language to create targeted training sets for low-resource languages before fine-tuning.

Masked language modeling experiments

Leverage the cleaned documents directly for fill-mask pretraining objectives on diverse web text.

Load HPLT2.0_cleaned

Python

from datasets import load_dataset

ds = load_dataset("jobs-git/HPLT2.0_cleaned")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('jobs-git/HPLT2.0_cleaned')
4Select language splits or columns needed for your task
5Stream or download Parquet shards for local processing

HPLT2.0_cleaned: pros & cons

Pros

+Over one trillion tokens
+191 languages covered
+Cleaned web data in efficient Parquet format
+Ready for fill-mask and text-generation workloads

Cons

–Extremely large size demands substantial storage and compute
–Web-crawled origin may retain some noise or bias
–License and exact filtering details not specified in description

Did you find this helpful?

Frequently asked questions

A cleaned collection of web-crawled documents in 191 languages, mostly from Internet Archive plus Common Crawl, released in Parquet format and exceeding one trillion tokens.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote HPLT2.0_cleaned

Add this badge to your website, or share the tool.

DFeatured on DhanasviHPLT2.0_cleaned 0

HPLT2.0_cleaned

Cleaned web documents in 191 languages from the HPLT project.

DatasetText & NLP↓ 185K/moFree

Open dataset

Updated 2026-06-18

What is HPLT2.0_cleaned?

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Use the trillion-token corpus to train or continue-pretrain large language models across 191 languages in a single pipeline.

Build language-specific corpora

Filter the Parquet files by language to create targeted training sets for low-resource languages before fine-tuning.

Masked language modeling experiments

Leverage the cleaned documents directly for fill-mask pretraining objectives on diverse web text.

Load HPLT2.0_cleaned

Python

from datasets import load_dataset

ds = load_dataset("jobs-git/HPLT2.0_cleaned")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('jobs-git/HPLT2.0_cleaned')
4Select language splits or columns needed for your task
5Stream or download Parquet shards for local processing

HPLT2.0_cleaned: pros & cons

Pros

+Over one trillion tokens
+191 languages covered
+Cleaned web data in efficient Parquet format
+Ready for fill-mask and text-generation workloads

Cons

–Extremely large size demands substantial storage and compute
–Web-crawled origin may retain some noise or bias
–License and exact filtering details not specified in description

Did you find this helpful?

Frequently asked questions

A cleaned collection of web-crawled documents in 191 languages, mostly from Internet Archive plus Common Crawl, released in Parquet format and exceeding one trillion tokens.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote HPLT2.0_cleaned

Add this badge to your website, or share the tool.

DFeatured on DhanasviHPLT2.0_cleaned 0

HPLT2.0_cleaned

What is HPLT2.0_cleaned?

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Build language-specific corpora

Masked language modeling experiments

Load HPLT2.0_cleaned

HPLT2.0_cleaned: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote HPLT2.0_cleaned

HPLT2.0_cleaned

What is HPLT2.0_cleaned?

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Build language-specific corpora

Masked language modeling experiments

Load HPLT2.0_cleaned

HPLT2.0_cleaned: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote HPLT2.0_cleaned

HPLT2.0_cleaned

What is HPLT2.0_cleaned?

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Build language-specific corpora

Masked language modeling experiments

Load HPLT2.0_cleaned

HPLT2.0_cleaned: pros & cons

Pros

Cons

Frequently asked questions

What is HPLT2.0_cleaned?

How do I access the dataset?

Is HPLT2.0_cleaned free to use?

What tasks is it intended for?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote HPLT2.0_cleaned

HPLT2.0_cleaned

What is HPLT2.0_cleaned?

What you can build with HPLT2.0_cleaned

Pretrain multilingual LLMs

Build language-specific corpora

Masked language modeling experiments

Load HPLT2.0_cleaned

HPLT2.0_cleaned: pros & cons

Pros

Cons

Frequently asked questions

What is HPLT2.0_cleaned?

How do I access the dataset?

Is HPLT2.0_cleaned free to use?

What tasks is it intended for?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote HPLT2.0_cleaned