hellaswag
VerifiedCommonsense NLI dataset for sentence completion benchmarks.
What is hellaswag?
HellaSwag contains context-plus-ending items that require commonsense knowledge to select the correct continuation from several options.
It is used by researchers building or benchmarking NLP models for commonsense reasoning and natural language inference.
What you can build with hellaswag
Benchmarking language models
Evaluate LLMs on sentence completion tasks requiring everyday commonsense to measure reasoning gaps beyond standard benchmarks.
Training commonsense NLI models
Fine-tune transformer models on the multiple-choice endings to improve performance in narrative prediction and inference.
Adversarial testing of AI systems
Use the dataset's tricky distractors to probe and harden models against superficial pattern matching in text generation.
Load hellaswag
from datasets import load_dataset
ds = load_dataset("Rowan/hellaswag")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('Rowan/hellaswag')
- 4Access splits via dataset['train'] or dataset['validation']
- 5Process examples with activity_label, ctx, and endings fields
hellaswag: pros & cons
Pros
- +Large scale with over 70k examples
- +Challenging distractors that fool current models
- +Directly tests real-world commonsense
- +Easy loading via Hugging Face
Cons
- –English only with no multilingual support
- –Some examples contain minor annotation noise
- –Primarily designed for 2019-era model evaluation
Frequently asked questions
A commonsense natural language inference dataset for testing whether models can correctly finish sentences with plausible endings.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…