ClawBench
VerifiedOpen benchmark for testing AI agents on computer-use tasks.
What is ClawBench?
ClawBench is an open-source benchmark designed to assess how well AI agents handle computer interaction tasks. It supplies curated datasets hosted on Hugging Face along with an evaluation library that scores agent traces against defined criteria.
Users run agents through the provided harnesses, collect execution traces, and compare results using the shared scoring pipeline. The framework supports both fixed-model and fixed-harness experiments to isolate performance factors.
Researchers and developers building GUI agents or computer-use systems rely on ClawBench to obtain reproducible metrics and to participate in community comparisons featured across multiple AI agent resource lists.
Capabilities
What you can build with ClawBench
Agent Performance Evaluation
Run standardized tests to quantify how different models perform on the same set of computer-use tasks.
Harness Comparison
Fix the underlying model and vary the harness to measure the impact of different evaluation setups.
Dataset Exploration
Access trace datasets to analyze agent behavior patterns and failure modes in detail.
Install ClawBench
pip install clawbench-evalgit clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh- 1Clone the ClawBench repository from GitHub.
- 2Install the clawbench-eval package via pip.
- 3Download the relevant Hugging Face datasets for your evaluation.
- 4Execute the benchmark script with your agent configuration.
- 5Review the generated scores and trace outputs.
ClawBench: pros & cons
Pros
- +Fully open-source with public datasets and code.
- +Includes both task and trace datasets for thorough analysis.
- +Provides a consistent scoring pipeline across experiments.
- +Actively maintained with community features like Discord.
Cons
- –Requires familiarity with agent harness setups to run effectively.
- –Evaluation depends on access to compatible agent implementations.
- –Limited to the specific task domains covered in the datasets.
Frequently asked questions
Yes, the code, datasets, and evaluation tools are all open-source and publicly available.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…