Skip to content
SWE-bench_Verified logo

SWE-bench_Verified

Verified

Human-validated subset of 500 GitHub issue resolution tasks from SWE-bench.

DatasetAI & Machine Learning906K/moFree
Open dataset
Updated 2026-06-15

What is SWE-bench_Verified?

SWE-bench Verified contains 500 human-validated test instances drawn from the SWE-bench dataset. It focuses on automatic resolution of GitHub issues in Python repositories through Issue-Pull Request pairs.

It is useful for researchers and developers evaluating AI systems on real-world software engineering tasks, particularly those involving code changes verified by unit tests.

What you can build with SWE-bench_Verified

Benchmark LLM-based code repair agents

Run models on the 500 validated issue-PR pairs to measure how often generated patches pass the post-PR unit tests.

Compare agent performance on real GitHub issues

Use the dataset to evaluate different systems on their ability to resolve bugs from popular Python repositories with objective test verification.

Develop and test issue-resolution pipelines

Feed issue descriptions into retrieval or generation pipelines and score outputs against the verified test suites included in each sample.

Load SWE-bench_Verified

Python
from datasets import load_dataset

ds = load_dataset("princeton-nlp/SWE-bench_Verified")
  1. 1pip install datasets
  2. 2from datasets import load_dataset
  3. 3ds = load_dataset('princeton-nlp/SWE-bench_Verified')
  4. 4Load the 'test' split to access the 500 samples
  5. 5Use fields such as 'problem_statement', 'patch', and 'test_patch' for evaluation

SWE-bench_Verified: pros & cons

Pros

  • +Human-validated subset reduces noise
  • +Objective unit-test scoring
  • +Real issues from popular Python repos
  • +Directly compatible with Hugging Face datasets

Cons

  • Only 500 examples total
  • Python-only repositories
  • Requires full repo checkout and test execution for full evaluation
Did you find this helpful?

Frequently asked questions

A curated 500-sample subset of SWE-bench with human-validated Issue-PR pairs from Python repositories, scored via unit tests.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote SWE-bench_Verified

Add this badge to your website, or share the tool.

DFeatured on DhanasviSWE-bench_Verified 1