Skip to content
SWE-bench logo

SWE-bench

Verified

Benchmark evaluating AI models on real GitHub software issues.

Autonomous AgentsGeneral-Purpose 5.2kOpen source
View on GitHub
Updated 2026-06-15
SWE-bench GitHub repository

What is SWE-bench?

SWE-bench is an evaluation benchmark that measures how well large language models handle authentic software engineering tasks pulled from GitHub. Models receive a full codebase plus an issue description and must output a code patch that solves the reported problem.

Evaluations run inside Docker containers to ensure reproducible results across different setups. Users load the dataset from Hugging Face, generate predictions, then run the harness to score patches against verified solutions. A verified subset of 500 tasks and a multimodal version for visual interfaces are also available.

The project targets researchers and developers building AI coding agents who need standardized, challenging tests drawn from real-world development scenarios rather than synthetic problems.

What you can build with SWE-bench

Model Leaderboards

Compare different LLMs and agents on their ability to fix real issues using the public test splits and official scoring.

Agent Development

Test new agent architectures that interact with codebases by measuring patch success rates on the benchmark tasks.

Reproducible Research

Run controlled experiments with Docker-based evaluation to publish reliable results on software engineering capabilities.

Install SWE-bench

Install
pip install -e .
Quick start
from datasets import load_dataset
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
  1. 1Install Docker following the official platform instructions for your OS.
  2. 2Clone the SWE-bench repository from GitHub and navigate into the directory.
  3. 3Run pip install -e . to set up the package in editable mode.
  4. 4Load the dataset using the Hugging Face datasets library in Python.
  5. 5Execute the run_evaluation script with your predictions file and desired instance IDs.

SWE-bench: pros & cons

Pros

  • +Uses real GitHub issues for practical relevance
  • +Docker containers deliver reproducible evaluations
  • +Includes verified task subset and multimodal extension
  • +Cloud evaluation option via Modal reduces local setup burden

Cons

  • Requires Docker which adds setup complexity
  • Multimodal test set evaluation remains private
  • Full runs can be computationally expensive
Did you find this helpful?

Frequently asked questions

Load it directly with the datasets library from Hugging Face using the princeton-nlp/SWE-bench identifier.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote SWE-bench

Add this badge to your website, or share the tool.

DFeatured on DhanasviSWE-bench 0