SWE-bench_Verified
VerifiedHuman-validated subset of 500 GitHub issue resolution tasks from SWE-bench.
What is SWE-bench_Verified?
SWE-bench Verified contains 500 human-validated test instances drawn from the SWE-bench dataset. It focuses on automatic resolution of GitHub issues in Python repositories through Issue-Pull Request pairs.
It is useful for researchers and developers evaluating AI systems on real-world software engineering tasks, particularly those involving code changes verified by unit tests.
What you can build with SWE-bench_Verified
Benchmark LLM-based code repair agents
Run models on the 500 validated issue-PR pairs to measure how often generated patches pass the post-PR unit tests.
Compare agent performance on real GitHub issues
Use the dataset to evaluate different systems on their ability to resolve bugs from popular Python repositories with objective test verification.
Develop and test issue-resolution pipelines
Feed issue descriptions into retrieval or generation pipelines and score outputs against the verified test suites included in each sample.
Load SWE-bench_Verified
from datasets import load_dataset
ds = load_dataset("princeton-nlp/SWE-bench_Verified")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('princeton-nlp/SWE-bench_Verified')
- 4Load the 'test' split to access the 500 samples
- 5Use fields such as 'problem_statement', 'patch', and 'test_patch' for evaluation
SWE-bench_Verified: pros & cons
Pros
- +Human-validated subset reduces noise
- +Objective unit-test scoring
- +Real issues from popular Python repos
- +Directly compatible with Hugging Face datasets
Cons
- –Only 500 examples total
- –Python-only repositories
- –Requires full repo checkout and test execution for full evaluation
Frequently asked questions
A curated 500-sample subset of SWE-bench with human-validated Issue-PR pairs from Python repositories, scored via unit tests.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…