Skip to content
ClawBench logo

ClawBench

Verified

Open benchmark for testing AI agents on computer-use tasks.

Autonomous AgentsWeb & Browser 393Open source
View on GitHub
Updated 2026-06-16
ClawBench GitHub repository

What is ClawBench?

ClawBench is an open-source benchmark designed to assess how well AI agents handle computer interaction tasks. It supplies curated datasets hosted on Hugging Face along with an evaluation library that scores agent traces against defined criteria.

Users run agents through the provided harnesses, collect execution traces, and compare results using the shared scoring pipeline. The framework supports both fixed-model and fixed-harness experiments to isolate performance factors.

Researchers and developers building GUI agents or computer-use systems rely on ClawBench to obtain reproducible metrics and to participate in community comparisons featured across multiple AI agent resource lists.

Capabilities

benchmark browser agents on real sites
run 153 everyday tasks
evaluate across 15 categories
intercept submissions for safe testing

What you can build with ClawBench

Agent Performance Evaluation

Run standardized tests to quantify how different models perform on the same set of computer-use tasks.

Harness Comparison

Fix the underlying model and vary the harness to measure the impact of different evaluation setups.

Dataset Exploration

Access trace datasets to analyze agent behavior patterns and failure modes in detail.

Install ClawBench

Install
pip install clawbench-eval
Quick start
git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh
  1. 1Clone the ClawBench repository from GitHub.
  2. 2Install the clawbench-eval package via pip.
  3. 3Download the relevant Hugging Face datasets for your evaluation.
  4. 4Execute the benchmark script with your agent configuration.
  5. 5Review the generated scores and trace outputs.

ClawBench: pros & cons

Pros

  • +Fully open-source with public datasets and code.
  • +Includes both task and trace datasets for thorough analysis.
  • +Provides a consistent scoring pipeline across experiments.
  • +Actively maintained with community features like Discord.

Cons

  • Requires familiarity with agent harness setups to run effectively.
  • Evaluation depends on access to compatible agent implementations.
  • Limited to the specific task domains covered in the datasets.
Did you find this helpful?

Frequently asked questions

Yes, the code, datasets, and evaluation tools are all open-source and publicly available.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote ClawBench

Add this badge to your website, or share the tool.

DFeatured on DhanasviClawBench 0