How often is the list updated?

The maintainers note that updates are continuous and accept contributions via PRs.

Can I add a new benchmark I found?

Yes, the project explicitly invites additions through GitHub pull requests or issues.

LLM-Agent-Benchmark-List — Autonomous Agents Review, Install & Alternatives (2026)

What is LLM-Agent-Benchmark-List?

LLM-Agent-Benchmark-List is a community-maintained directory that catalogs benchmarks designed to evaluate large language models when used as agents. It groups resources into categories such as surveys, tool-use evaluations, reasoning tasks, knowledge integration, and graph-based assessments.

Users browse the compiled list to discover relevant papers, GitHub repositories, and datasets without searching across scattered sources. Each entry includes publication dates, authors, arXiv links, and project pages to support quick access and citation.

The resource primarily serves AI researchers, benchmark creators, and engineers who need standardized ways to measure agent capabilities before deploying models in real applications.

Capabilities

curate llm benchmarks

support model evaluation

list agent tests

provide evaluation resources

What you can build with LLM-Agent-Benchmark-List

Selecting evaluation suites

Researchers scan the categorized lists to identify suitable benchmarks for testing new agent frameworks on tool calling or multi-step reasoning.

Tracking recent progress

Developers review the latest survey papers and benchmark releases to stay current on evaluation methods in the fast-moving LLM field.

Contributing new entries

Contributors submit pull requests to add overlooked benchmarks, keeping the collection comprehensive for the broader community.

Install LLM-Agent-Benchmark-List

1Visit the GitHub repository page for LLM-Agent-Benchmark-List.
2Review the README sections organized by Survey, ToolUse, Reasoning, Knowledge, and Graph.
3Click any linked paper or project page to access the original benchmark materials.
4Fork the repo and open a pull request to suggest additions or corrections.
5Star the repository to receive notifications about future updates.

LLM-Agent-Benchmark-List: pros & cons

Pros

+Organizes scattered benchmarks into clear topical categories
+Includes direct links to papers and code for fast follow-up
+Actively maintained with community contributions welcomed
+Covers multiple evaluation dimensions relevant to agents

Cons

–Provides only links rather than runnable benchmark code
–Quality and coverage depend on external submissions
–No built-in tooling for running or comparing results

Did you find this helpful?

Frequently asked questions

It is a curated reading list of benchmarks with links, not executable software.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar agents

Other research options worth comparing.

generativeagents

Agent · Research

Verified

Simulate believable human behaviors using generative agents in a virtual town.

21.5kOpen source

A Survey on Large Language Model based Autonomous Agents

Agent · Research

Verified

In-depth survey mapping the architecture of LLM-driven autonomous agents.

2.9kOpen source

Awesome-AgenticLLM-RL-Papers

Agent · Research

Verified

Official repo for a survey on agentic RL methods for LLMs.

1.8kOpen source

LLM-Agent-Benchmark-List

What is LLM-Agent-Benchmark-List?

Capabilities

What you can build with LLM-Agent-Benchmark-List

Selecting evaluation suites

Tracking recent progress

Contributing new entries

Install LLM-Agent-Benchmark-List

LLM-Agent-Benchmark-List: pros & cons

Pros

Cons

Frequently asked questions

Is this a runnable tool or just a reading list?

How often is the list updated?

Can I add a new benchmark I found?

User reviews

Similar agents

generativeagents

A Survey on Large Language Model based Autonomous Agents

Awesome-AgenticLLM-RL-Papers

Promote LLM-Agent-Benchmark-List