wikitext
VerifiedOver 100 million tokens from Wikipedia for language modeling benchmarks.
What is wikitext?
WikiText is a language modeling collection drawn from high-quality Wikipedia articles, providing more than 100 million tokens across multiple size variants.
It serves researchers and practitioners building or benchmarking models for text generation and masked language modeling in NLP.
What you can build with wikitext
Train autoregressive language models
Use WikiText-103 to pretrain or fine-tune GPT-style models on next-token prediction with over 100M tokens of clean text.
Benchmark fill-mask performance
Evaluate BERT-like models on the WikiText-2 validation split for masked language modeling accuracy.
Create domain-adapted tokenizers
Build and test subword tokenizers on the raw Wikipedia text before applying them to downstream corpora.
Load wikitext
from datasets import load_dataset
ds = load_dataset("Salesforce/wikitext")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('wikitext', 'wikitext-103-raw-v1')
- 4train_text = ds['train']['text']
- 5Tokenize and batch the splits for your training loop
wikitext: pros & cons
Pros
- +Over 100 million tokens from vetted articles
- +Multiple size variants (2 and 103) for quick experiments
- +Clean, deduplicated text with no markup
- +Permissive CC-BY-SA license
Cons
- –English Wikipedia only
- –No sentence or document boundaries annotated
- –Contains occasional factual or stylistic biases
Frequently asked questions
A large collection of verified Wikipedia articles released in raw text form for language modeling research.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…