Skip to content
AgentBench logo

AgentBench

Verified

Benchmark LLMs thoroughly in agent roles and tasks.

Autonomous AgentsGeneral-Purpose 3.5kOpen source
View on GitHub
Updated 2026-06-15
AgentBench GitHub repository

What is AgentBench?

AgentBench is an open-source benchmark created to measure how effectively LLMs function when deployed as agents in defined scenarios.

It runs evaluations through standardized tasks that assess planning, interaction, and outcome achievement using consistent scoring methods.

The benchmark targets researchers and developers who need reliable data to compare and improve agent-oriented language models.

What you can build with AgentBench

Model Comparison

Run side-by-side tests to rank multiple LLMs on identical agent benchmarks.

Capability Tracking

Measure specific agent skills such as task completion and error recovery over time.

Baseline Establishment

Create reference scores that new agent designs can be measured against.

Install AgentBench

Quick start
# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles
  1. 1Obtain the source code from the public repository.
  2. 2Set up the Python environment and install listed packages.
  3. 3Select or configure the target LLMs and task suites.
  4. 4Launch the benchmark runner with chosen parameters.
  5. 5Review output logs and performance summaries.

AgentBench: pros & cons

Pros

  • +Broad collection of agent-focused evaluation scenarios
  • +Fully open-source with transparent methodology
  • +Enables reproducible comparisons across models
  • +Provides quantitative metrics for agent behaviors

Cons

  • Runs only in simulated rather than live environments
  • Demands notable compute for large-scale testing
  • Scope limited to predefined task types
Did you find this helpful?

Frequently asked questions

It measures LLM performance specifically when models act as agents.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote AgentBench

Add this badge to your website, or share the tool.

DFeatured on DhanasviAgentBench 0