SWE-bench
VerifiedBenchmark evaluating AI models on real GitHub software issues.
What is SWE-bench?
SWE-bench is an evaluation benchmark that measures how well large language models handle authentic software engineering tasks pulled from GitHub. Models receive a full codebase plus an issue description and must output a code patch that solves the reported problem.
Evaluations run inside Docker containers to ensure reproducible results across different setups. Users load the dataset from Hugging Face, generate predictions, then run the harness to score patches against verified solutions. A verified subset of 500 tasks and a multimodal version for visual interfaces are also available.
The project targets researchers and developers building AI coding agents who need standardized, challenging tests drawn from real-world development scenarios rather than synthetic problems.
What you can build with SWE-bench
Model Leaderboards
Compare different LLMs and agents on their ability to fix real issues using the public test splits and official scoring.
Agent Development
Test new agent architectures that interact with codebases by measuring patch success rates on the benchmark tasks.
Reproducible Research
Run controlled experiments with Docker-based evaluation to publish reliable results on software engineering capabilities.
Install SWE-bench
pip install -e .from datasets import load_dataset
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')- 1Install Docker following the official platform instructions for your OS.
- 2Clone the SWE-bench repository from GitHub and navigate into the directory.
- 3Run pip install -e . to set up the package in editable mode.
- 4Load the dataset using the Hugging Face datasets library in Python.
- 5Execute the run_evaluation script with your predictions file and desired instance IDs.
SWE-bench: pros & cons
Pros
- +Uses real GitHub issues for practical relevance
- +Docker containers deliver reproducible evaluations
- +Includes verified task subset and multimodal extension
- +Cloud evaluation option via Modal reduces local setup burden
Cons
- –Requires Docker which adds setup complexity
- –Multimodal test set evaluation remains private
- –Full runs can be computationally expensive
Frequently asked questions
Load it directly with the datasets library from Hugging Face using the princeton-nlp/SWE-bench identifier.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…