Skip to content
mmlu logo

mmlu

Verified

Massive multitask benchmark of multiple-choice questions across 57 subjects.

DatasetText & NLP507K/moFree
Open dataset
Updated 2026-06-15

What is mmlu?

MMLU is a collection of multiple-choice questions spanning 57 tasks in the humanities, social sciences, and sciences. It was introduced to test broad knowledge and reasoning in language models.

The benchmark is used by researchers to measure and compare model performance across many domains at once.

Data preview

A real sample from the dataset — 4 columns.

questionstringsubjectstringchoicesListanswerClassLabel
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.abstract_algebra["0","4","2","6"]1
Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.abstract_algebra["8","2","24","120"]2
Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5abstract_algebra["0","1","0,1","0,4"]3
Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.abstract_algebra["True, True","False, False","True, False","False, True"]1
Find the product of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8[x].abstract_algebra["2x^2 + 5","6x^2 + 4x + 6","0","x^2 + 1"]1

Dataset structure

Total rows
231,400
Columns
4
Size on disk
98.8 MB
SubsetSplitRows
abstract_algebratest116
abstract_algebravalidation116
abstract_algebradev116
alltest115,700
allvalidation115,700
alldev115,700
allauxiliary_train115,700
anatomytest154
anatomyvalidation154
anatomydev154
astronomytest173
astronomyvalidation173

What you can build with mmlu

LLM Benchmarking

Run standardized evaluations of language models across 57 subjects to measure knowledge breadth in humanities, sciences, and professions.

Zero-shot Performance Testing

Assess models on multiple-choice question answering without additional training using the built-in train/validation/test splits.

Subject-specific Analysis

Isolate individual subjects like mathematics or history to diagnose model strengths and weaknesses in targeted domains.

Load mmlu

Python
from datasets import load_dataset

ds = load_dataset("cais/mmlu")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3dataset = load_dataset('cais/mmlu')
  4. 4Select a subject subset such as dataset['auxiliary_train'] or specific test splits
  5. 5Parse each example's question, choices, and answer for evaluation scripts

mmlu: pros & cons

Pros

  • +Broad coverage of 57 subjects
  • +Large scale between 100K and 1M examples
  • +Multiple-choice format simplifies automated scoring
  • +Direct support for question-answering evaluation

Cons

  • Restricted to multiple-choice questions only
  • Subject splits must be handled manually
  • Dataset size varies by subject
Did you find this helpful?

Frequently asked questions

A collection of multiple-choice questions spanning 57 academic and professional subjects for evaluating question-answering systems.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote mmlu

Add this badge to your website, or share the tool.

DFeatured on Dhanasvimmlu 0