Skip to content
wikitext logo

wikitext

Verified

Over 100 million tokens from Wikipedia for language modeling benchmarks.

DatasetText & NLP1.3M/moFree
Open dataset
Updated 2026-06-15

What is wikitext?

WikiText is a language modeling collection drawn from high-quality Wikipedia articles, providing more than 100 million tokens across multiple size variants.

It serves researchers and practitioners building or benchmarking models for text generation and masked language modeling in NLP.

What you can build with wikitext

Train autoregressive language models

Use WikiText-103 to pretrain or fine-tune GPT-style models on next-token prediction with over 100M tokens of clean text.

Benchmark fill-mask performance

Evaluate BERT-like models on the WikiText-2 validation split for masked language modeling accuracy.

Create domain-adapted tokenizers

Build and test subword tokenizers on the raw Wikipedia text before applying them to downstream corpora.

Load wikitext

Python
from datasets import load_dataset

ds = load_dataset("Salesforce/wikitext")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3ds = load_dataset('wikitext', 'wikitext-103-raw-v1')
  4. 4train_text = ds['train']['text']
  5. 5Tokenize and batch the splits for your training loop

wikitext: pros & cons

Pros

  • +Over 100 million tokens from vetted articles
  • +Multiple size variants (2 and 103) for quick experiments
  • +Clean, deduplicated text with no markup
  • +Permissive CC-BY-SA license

Cons

  • English Wikipedia only
  • No sentence or document boundaries annotated
  • Contains occasional factual or stylistic biases
Did you find this helpful?

Frequently asked questions

A large collection of verified Wikipedia articles released in raw text form for language modeling research.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote wikitext

Add this badge to your website, or share the tool.

DFeatured on Dhanasviwikitext 1