Is WikiText free to use?

Yes, it is publicly available at no cost via the Hugging Face Hub.

What license applies to WikiText?

Creative Commons Attribution-ShareAlike 3.0, requiring attribution and share-alike terms.

How do I load WikiText in Python?

Use the Hugging Face datasets library with load_dataset('wikitext', 'wikitext-103-raw-v1').

wikitext — Free Dataset Docs, Examples & Alternatives (2026)

What is wikitext?

WikiText is a language modeling collection drawn from high-quality Wikipedia articles, providing more than 100 million tokens across multiple size variants.

It serves researchers and practitioners building or benchmarking models for text generation and masked language modeling in NLP.

What you can build with wikitext

Train autoregressive language models

Use WikiText-103 to pretrain or fine-tune GPT-style models on next-token prediction with over 100M tokens of clean text.

Benchmark fill-mask performance

Evaluate BERT-like models on the WikiText-2 validation split for masked language modeling accuracy.

Create domain-adapted tokenizers

Build and test subword tokenizers on the raw Wikipedia text before applying them to downstream corpora.

Load wikitext

Python

from datasets import load_dataset

ds = load_dataset("Salesforce/wikitext")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('wikitext', 'wikitext-103-raw-v1')
4train_text = ds['train']['text']
5Tokenize and batch the splits for your training loop

wikitext: pros & cons

Pros

+Over 100 million tokens from vetted articles
+Multiple size variants (2 and 103) for quick experiments
+Clean, deduplicated text with no markup
+Permissive CC-BY-SA license

Cons

–English Wikipedia only
–No sentence or document boundaries annotated
–Contains occasional factual or stylistic biases

Did you find this helpful?

Frequently asked questions

A large collection of verified Wikipedia articles released in raw text form for language modeling research.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other text & nlp options worth comparing.

KakologArchives

Text & NLP · KakologArchives

Verified

Archive of 11 years of Nico Nico Jikkyo live commentary logs.

Dataset↓ 1.8MFree

gsm8k

Text & NLP · openai

Verified

8.5K grade school math word problems requiring multi-step arithmetic reasoning.

Dataset↓ 901KFree

c4

Text & NLP · allenai

Verified

Cleaned Common Crawl corpus with multiple language variants for NLP training.

Dataset↓ 827KFree

wikitext

What is wikitext?

What you can build with wikitext

Train autoregressive language models

Benchmark fill-mask performance

Create domain-adapted tokenizers

Load wikitext

wikitext: pros & cons

Pros

Cons

Frequently asked questions

What is the WikiText dataset?

Is WikiText free to use?

What license applies to WikiText?

How do I load WikiText in Python?

User reviews

Similar datasets

KakologArchives

gsm8k

c4

Promote wikitext