Challenging benchmark of 12K complex questions for LLM evaluation.
MMLU-Pro is a dataset of 12K complex questions across various disciplines designed for multi-task understanding benchmarks.
It supports evaluation of large language models by researchers and developers working on question-answering capabilities.
Run models on the 12K questions to measure accuracy across disciplines and compare results to public leaderboards.
Inspect incorrect predictions on complex items to identify weaknesses in reasoning or domain knowledge.
Use the questions as hard negative or few-shot examples when fine-tuning or prompting newer models.
from datasets import load_dataset
ds = load_dataset("TIGER-Lab/MMLU-Pro")A harder multi-task benchmark with 12K complex questions designed to test LLMs more rigorously than the original MMLU.
Verified reviews from the community shape this listing's rating.
Loading reviews…