Multilingual translations of FineWeb-Edu documents across 36 languages.
The dataset supplies parallel machine-translated versions of educational web documents, with all translations aligned across languages.
It is intended for work on multilingual translation and text-generation models that require large-scale aligned educational text.
Fine-tune models to categorize or rank educational web content across 36 languages using the aligned translations for consistent labeling.
Benchmark OPUS-MT or custom MT systems by comparing outputs against the provided translations of 28 billion English educational tokens.
Develop search or recommendation engines that retrieve aligned educational documents in any of the 36 target languages from a single query.
from datasets import load_dataset
ds = load_dataset("Helsinki-NLP/fineweb-edu-translated")A large collection of automatically translated educational documents from FineWeb-Edu, aligned across 36 languages with over 960 billion tokens total.
Verified reviews from the community shape this listing's rating.
Loading reviews…