Is EuroWeb-2512 free to use?

Yes, it is hosted publicly on the Hugging Face Hub and loaded via the datasets library.

How do I access the dataset?

Use the Hugging Face datasets library with the identifier utter-project/EuroWeb-2512.

Where can I find more details?

See the EuroLLM-22B Technical Report and the full dataset page on Hugging Face.

EuroWeb-2512

Multilingual web data filtered for European LLM pretraining.

DatasetData & Analytics↓ 151K/moFree

Open dataset

Updated 2026-06-18

What is EuroWeb-2512?

EuroWeb-2512 is a dataset of multilingual web data gathered from multiple sources, processed with standard practices, and classified with EuroFilter-v1.

It supports LLM pretraining focused on European languages.

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Use the filtered web corpus to continue pretraining or fine-tune models like EuroLLM-22B on high-quality multilingual European data.

Evaluating web data filters

Benchmark new content classifiers by comparing against the EuroFilter-v1 labels already applied to this collection.

Multilingual web research

Analyze language distribution, domain coverage, and quality signals across the collected European web sources.

Load EuroWeb-2512

Python

from datasets import load_dataset

ds = load_dataset("utter-project/EuroWeb-2512")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('utter-project/EuroWeb-2512')
4Inspect available splits and columns
5Stream subsets with streaming=True for large-scale use

EuroWeb-2512: pros & cons

Pros

+Multilingual European web data
+Already processed with EuroFilter-v1
+Linked to EuroLLM-22B technical report
+Directly loadable via Hugging Face datasets

Cons

–Web data can still contain noise
–Exact license not stated in summary
–Size may require substantial storage

Did you find this helpful?

Frequently asked questions

A multilingual web dataset collected from various sources, processed with standard practices and classified using utter-project/EuroFilter-v1.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote EuroWeb-2512

Add this badge to your website, or share the tool.

DFeatured on DhanasviEuroWeb-2512 0

EuroWeb-2512

Multilingual web data filtered for European LLM pretraining.

DatasetData & Analytics↓ 151K/moFree

Open dataset

Updated 2026-06-18

What is EuroWeb-2512?

EuroWeb-2512 is a dataset of multilingual web data gathered from multiple sources, processed with standard practices, and classified with EuroFilter-v1.

It supports LLM pretraining focused on European languages.

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Use the filtered web corpus to continue pretraining or fine-tune models like EuroLLM-22B on high-quality multilingual European data.

Evaluating web data filters

Benchmark new content classifiers by comparing against the EuroFilter-v1 labels already applied to this collection.

Multilingual web research

Analyze language distribution, domain coverage, and quality signals across the collected European web sources.

Load EuroWeb-2512

Python

from datasets import load_dataset

ds = load_dataset("utter-project/EuroWeb-2512")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('utter-project/EuroWeb-2512')
4Inspect available splits and columns
5Stream subsets with streaming=True for large-scale use

EuroWeb-2512: pros & cons

Pros

+Multilingual European web data
+Already processed with EuroFilter-v1
+Linked to EuroLLM-22B technical report
+Directly loadable via Hugging Face datasets

Cons

–Web data can still contain noise
–Exact license not stated in summary
–Size may require substantial storage

Did you find this helpful?

Frequently asked questions

A multilingual web dataset collected from various sources, processed with standard practices and classified using utter-project/EuroFilter-v1.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote EuroWeb-2512

Add this badge to your website, or share the tool.

DFeatured on DhanasviEuroWeb-2512 0

EuroWeb-2512

What is EuroWeb-2512?

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Evaluating web data filters

Multilingual web research

Load EuroWeb-2512

EuroWeb-2512: pros & cons

Pros

Cons

Frequently asked questions

User reviews

GiftEvalPretrain

pretraining_v1-omega_books

Promote EuroWeb-2512

EuroWeb-2512

What is EuroWeb-2512?

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Evaluating web data filters

Multilingual web research

Load EuroWeb-2512

EuroWeb-2512: pros & cons

Pros

Cons

Frequently asked questions

User reviews

GiftEvalPretrain

pretraining_v1-omega_books

Promote EuroWeb-2512

EuroWeb-2512

What is EuroWeb-2512?

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Evaluating web data filters

Multilingual web research

Load EuroWeb-2512

EuroWeb-2512: pros & cons

Pros

Cons

Frequently asked questions

What is EuroWeb-2512?

Is EuroWeb-2512 free to use?

How do I access the dataset?

Where can I find more details?

User reviews

Similar datasets

GiftEvalPretrain

pretraining_v1-omega_books

Promote EuroWeb-2512

EuroWeb-2512

What is EuroWeb-2512?

What you can build with EuroWeb-2512

Pretraining European-focused LLMs

Evaluating web data filters

Multilingual web research

Load EuroWeb-2512

EuroWeb-2512: pros & cons

Pros

Cons

Frequently asked questions

What is EuroWeb-2512?

Is EuroWeb-2512 free to use?

How do I access the dataset?

Where can I find more details?

User reviews

Similar datasets

GiftEvalPretrain

pretraining_v1-omega_books

Promote EuroWeb-2512