Deduplicated FineWeb-Edu with embeddings and occurrence counts.
Fineweb-Edu-Fortified contains deduplicated rows from FineWeb-Edu, each augmented with a bge-micro embedding and an occurrence count.
It is suited for text-generation workloads that require cleaned educational web text at large scale with precomputed embeddings.
Use the duplicate count column to weight samples or filter repeats when fine-tuning models on educational web text.
Leverage the included bge-micro embeddings to prototype semantic search or clustering over cleaned educational content.
Compare model performance or data statistics before and after exact-match deduplication using the fortified version.
from datasets import load_dataset
ds = load_dataset("airtrain-ai/fineweb-edu-fortified")A deduplicated version of FineWeb-Edu with added embeddings and duplicate counts from a 500k sample.
Verified reviews from the community shape this listing's rating.
Loading reviews…