LLM-Agent-Benchmark-List
VerifiedCurated collection of benchmarks for LLM agents and tool use.
What is LLM-Agent-Benchmark-List?
LLM-Agent-Benchmark-List is a community-maintained directory that catalogs benchmarks designed to evaluate large language models when used as agents. It groups resources into categories such as surveys, tool-use evaluations, reasoning tasks, knowledge integration, and graph-based assessments.
Users browse the compiled list to discover relevant papers, GitHub repositories, and datasets without searching across scattered sources. Each entry includes publication dates, authors, arXiv links, and project pages to support quick access and citation.
The resource primarily serves AI researchers, benchmark creators, and engineers who need standardized ways to measure agent capabilities before deploying models in real applications.
Capabilities
What you can build with LLM-Agent-Benchmark-List
Selecting evaluation suites
Researchers scan the categorized lists to identify suitable benchmarks for testing new agent frameworks on tool calling or multi-step reasoning.
Tracking recent progress
Developers review the latest survey papers and benchmark releases to stay current on evaluation methods in the fast-moving LLM field.
Contributing new entries
Contributors submit pull requests to add overlooked benchmarks, keeping the collection comprehensive for the broader community.
Install LLM-Agent-Benchmark-List
- 1Visit the GitHub repository page for LLM-Agent-Benchmark-List.
- 2Review the README sections organized by Survey, ToolUse, Reasoning, Knowledge, and Graph.
- 3Click any linked paper or project page to access the original benchmark materials.
- 4Fork the repo and open a pull request to suggest additions or corrections.
- 5Star the repository to receive notifications about future updates.
LLM-Agent-Benchmark-List: pros & cons
Pros
- +Organizes scattered benchmarks into clear topical categories
- +Includes direct links to papers and code for fast follow-up
- +Actively maintained with community contributions welcomed
- +Covers multiple evaluation dimensions relevant to agents
Cons
- –Provides only links rather than runnable benchmark code
- –Quality and coverage depend on external submissions
- –No built-in tooling for running or comparing results
Frequently asked questions
It is a curated reading list of benchmarks with links, not executable software.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…