mmlu
VerifiedMassive multitask benchmark of multiple-choice questions across 57 subjects.
What is mmlu?
MMLU is a collection of multiple-choice questions spanning 57 tasks in the humanities, social sciences, and sciences. It was introduced to test broad knowledge and reasoning in language models.
The benchmark is used by researchers to measure and compare model performance across many domains at once.
Data preview
A real sample from the dataset — 4 columns.
| questionstring | subjectstring | choicesList | answerClassLabel |
|---|---|---|---|
| Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. | abstract_algebra | ["0","4","2","6"] | 1 |
| Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5. | abstract_algebra | ["8","2","24","120"] | 2 |
| Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5 | abstract_algebra | ["0","1","0,1","0,4"] | 3 |
| Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G. | abstract_algebra | ["True, True","False, False","True, False","False, True"] | 1 |
| Find the product of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8[x]. | abstract_algebra | ["2x^2 + 5","6x^2 + 4x + 6","0","x^2 + 1"] | 1 |
Dataset structure
| Subset | Split | Rows |
|---|---|---|
| abstract_algebra | test | 116 |
| abstract_algebra | validation | 116 |
| abstract_algebra | dev | 116 |
| all | test | 115,700 |
| all | validation | 115,700 |
| all | dev | 115,700 |
| all | auxiliary_train | 115,700 |
| anatomy | test | 154 |
| anatomy | validation | 154 |
| anatomy | dev | 154 |
| astronomy | test | 173 |
| astronomy | validation | 173 |
What you can build with mmlu
LLM Benchmarking
Run standardized evaluations of language models across 57 subjects to measure knowledge breadth in humanities, sciences, and professions.
Zero-shot Performance Testing
Assess models on multiple-choice question answering without additional training using the built-in train/validation/test splits.
Subject-specific Analysis
Isolate individual subjects like mathematics or history to diagnose model strengths and weaknesses in targeted domains.
Load mmlu
from datasets import load_dataset
ds = load_dataset("cais/mmlu")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('cais/mmlu')
- 4Select a subject subset such as dataset['auxiliary_train'] or specific test splits
- 5Parse each example's question, choices, and answer for evaluation scripts
mmlu: pros & cons
Pros
- +Broad coverage of 57 subjects
- +Large scale between 100K and 1M examples
- +Multiple-choice format simplifies automated scoring
- +Direct support for question-answering evaluation
Cons
- –Restricted to multiple-choice questions only
- –Subject splits must be handled manually
- –Dataset size varies by subject
Frequently asked questions
A collection of multiple-choice questions spanning 57 academic and professional subjects for evaluating question-answering systems.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…