How is LLM Evaluation performed?

GPT-4 reviews agent actions and outputs using WebVoyager's evaluation prompt to judge actual task completion.

Can I add my own agent to the benchmark?

The setup is designed to be reproducible, so developers can extend it with additional agents following the published methodology.

open-operator-evals — Autonomous Agents Review, Install & Alternatives (2026)

What is open-operator-evals?

open-operator-evals is a reproducible benchmark suite that measures how well open-source web agents perform on realistic browser tasks. It executes each task multiple times under fixed constraints and records four core metrics: self-reported success, LLM-verified completion, average time, and task reliability across retries.

The evaluation uses the WebVoyager dataset of around 600 tasks. Agents operate in headless mode with strict step and time limits, allowing direct comparison of systems such as Notte, Browser-Use, and Convergence. All logs and replays are published so anyone can inspect or rerun the tests.

It is intended for researchers and developers building or selecting web agents who need objective data rather than marketing claims. The benchmark encourages closer scrutiny of reported performance numbers in the open-source agent space.

Capabilities

evaluate web browser agents

run reproducible evals

benchmark agent performance

test browser-based tasks

What you can build with open-operator-evals

Compare agent performance

Rank open-source web agents using consistent metrics across identical tasks and runs.

Verify published results

Reproduce evaluations locally to check claims made in blog posts or model cards.

Measure reliability under retries

Assess how consistently an agent succeeds when given multiple attempts on the same task.

Install open-operator-evals

Quick start

task: Book a journey with return option on same day from Edinburg to Manchester for Tomorrow, and show me the lowest price option available
url: https://www.google.com/travel/flights

1Clone the repository from GitHub.
2Install required Python dependencies.
3Download or prepare the WebVoyager task set.
4Run the evaluation script with chosen agents.
5Review the generated metrics and replays.

open-operator-evals: pros & cons

Pros

+Fully open logs and replays for every run
+Uses both self-report and independent LLM judging
+Includes reliability metric across multiple attempts
+Strict time and step limits prevent unrealistic results

Cons

–Currently covers only three agents
–Relies on a single task dataset
–No built-in support for custom agent integration shown

Did you find this helpful?

Frequently asked questions

It shows the percentage of tasks an agent completes successfully at least once across eight attempts.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar agents

Other web & browser options worth comparing.

Steel Browser

Agent · Web & Browser

Verified

Open-source browser API that powers web interactions for AI agents and automation apps.

7.2kOpen source

Agent-E

Agent · Web & Browser

Verified

Open-source agent for natural language browser automation and task handling.

1.2kOpen source

ClawBench

Agent · Web & Browser

Verified

Open benchmark for testing AI agents on computer-use tasks.

393Open source

open-operator-evals

What is open-operator-evals?

Capabilities

What you can build with open-operator-evals

Compare agent performance

Verify published results

Measure reliability under retries

Install open-operator-evals

open-operator-evals: pros & cons

Pros

Cons

Frequently asked questions

What does Task Reliability measure?

How is LLM Evaluation performed?

Can I add my own agent to the benchmark?

User reviews

Similar agents

Steel Browser

Agent-E

ClawBench

Promote open-operator-evals