What kind of agents does it support?

It targets GUI and computer-use agents that interact with environments through harnesses.

How does it differ from HarnessBench?

ClawBench varies the agent while keeping the harness fixed; HarnessBench does the opposite.

Where can I find the datasets?

They are hosted on Hugging Face under the NAIL-Group organization.

ClawBench — Autonomous Agents Review, Install & Alternatives (2026)

What is ClawBench?

ClawBench is an open-source benchmark designed to assess how well AI agents handle computer interaction tasks. It supplies curated datasets hosted on Hugging Face along with an evaluation library that scores agent traces against defined criteria.

Users run agents through the provided harnesses, collect execution traces, and compare results using the shared scoring pipeline. The framework supports both fixed-model and fixed-harness experiments to isolate performance factors.

Researchers and developers building GUI agents or computer-use systems rely on ClawBench to obtain reproducible metrics and to participate in community comparisons featured across multiple AI agent resource lists.

Capabilities

benchmark browser agents on real sites

run 153 everyday tasks

evaluate across 15 categories

intercept submissions for safe testing

What you can build with ClawBench

Agent Performance Evaluation

Run standardized tests to quantify how different models perform on the same set of computer-use tasks.

Harness Comparison

Fix the underlying model and vary the harness to measure the impact of different evaluation setups.

Dataset Exploration

Access trace datasets to analyze agent behavior patterns and failure modes in detail.

Install ClawBench

Install

pip install clawbench-eval

Quick start

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

1Clone the ClawBench repository from GitHub.
2Install the clawbench-eval package via pip.
3Download the relevant Hugging Face datasets for your evaluation.
4Execute the benchmark script with your agent configuration.
5Review the generated scores and trace outputs.

ClawBench: pros & cons

Pros

+Fully open-source with public datasets and code.
+Includes both task and trace datasets for thorough analysis.
+Provides a consistent scoring pipeline across experiments.
+Actively maintained with community features like Discord.

Cons

–Requires familiarity with agent harness setups to run effectively.
–Evaluation depends on access to compatible agent implementations.
–Limited to the specific task domains covered in the datasets.

Did you find this helpful?

Frequently asked questions

Yes, the code, datasets, and evaluation tools are all open-source and publicly available.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar agents

Other web & browser options worth comparing.

Steel Browser

Agent · Web & Browser

Verified

Open-source browser API that powers web interactions for AI agents and automation apps.

7.2kOpen source

Agent-E

Agent · Web & Browser

Verified

Open-source agent for natural language browser automation and task handling.

1.2kOpen source

BrowserTrace

Agent · Web & Browser

Verified

Local-first debugger for replaying browser agent runs and failures.

3Open source

ClawBench

What is ClawBench?

Capabilities

What you can build with ClawBench

Agent Performance Evaluation

Harness Comparison

Dataset Exploration

Install ClawBench

ClawBench: pros & cons

Pros

Cons

Frequently asked questions

Is ClawBench free to use?

What kind of agents does it support?

How does it differ from HarnessBench?

Where can I find the datasets?

User reviews

Similar agents

Steel Browser

Agent-E

BrowserTrace

Promote ClawBench