Multilingual web data filtered for European LLM pretraining.
EuroWeb-2512 is a dataset of multilingual web data gathered from multiple sources, processed with standard practices, and classified with EuroFilter-v1.
It supports LLM pretraining focused on European languages.
Use the filtered web corpus to continue pretraining or fine-tune models like EuroLLM-22B on high-quality multilingual European data.
Benchmark new content classifiers by comparing against the EuroFilter-v1 labels already applied to this collection.
Analyze language distribution, domain coverage, and quality signals across the collected European web sources.
from datasets import load_dataset
ds = load_dataset("utter-project/EuroWeb-2512")A multilingual web dataset collected from various sources, processed with standard practices and classified using utter-project/EuroFilter-v1.
Verified reviews from the community shape this listing's rating.
Loading reviews…