finephrase
VerifiedSynthetic text samples generated from FineWeb-Edu via SmolLM2-Instruct.
What is finephrase?
finephrase is a collection of synthetic text data generated from educational web samples using a 1.7B parameter instruction-tuned language model with fixed sampling parameters.
It is useful for researchers and developers working on text-generation model training or evaluation involving rewritten and FAQ-style content.
Dataset structure
| Subset | Split | Rows |
|---|---|---|
| all | train | — |
| faq | train | — |
| math | train | 338,747,732 |
| table | train | 338,546,433 |
| tutorial | train | 337,711,099 |
What you can build with finephrase
Fine-tuning FAQ generators
Train models to produce FAQ pairs from raw educational web text using the synthetic rewrite examples.
Text-rewriting pipelines
Build systems that rewrite long-form content into clearer, structured formats for documentation or study aids.
Instruction data augmentation
Expand LLM training sets with billions of synthetic instruction-response pairs derived from high-quality sources.
Load finephrase
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/finephrase")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('HuggingFaceFW/finephrase')
- 4Select train split or streaming mode for the multi-billion row collection
- 5Feed examples into your text-generation training loop
finephrase: pros & cons
Pros
- +Billions of synthetic examples at scale
- +Derived from curated FineWeb-Edu educational content
- +Consistent generation settings (temperature 1.0, top_p 1.0)
- +Directly supports text-generation and rewrite tasks
Cons
- –Dataset size (1B–10B entries) demands heavy storage and compute
- –Synthetic outputs may carry model-specific artifacts or biases
- –Narrow focus on FAQ/rewrite prompt families only
Frequently asked questions
A large synthetic collection generated by SmolLM2-1.7B-Instruct from FineWeb-Edu using FAQ and rewrite prompts.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…