Skip to content
open-operator-evals logo

open-operator-evals

Verified

Open benchmark comparing top open-source web agents on real tasks.

Autonomous AgentsWeb & Browser 49Open source
View on GitHub
Updated 2026-06-16
open-operator-evals GitHub repository

What is open-operator-evals?

open-operator-evals is a reproducible benchmark suite that measures how well open-source web agents perform on realistic browser tasks. It executes each task multiple times under fixed constraints and records four core metrics: self-reported success, LLM-verified completion, average time, and task reliability across retries.

The evaluation uses the WebVoyager dataset of around 600 tasks. Agents operate in headless mode with strict step and time limits, allowing direct comparison of systems such as Notte, Browser-Use, and Convergence. All logs and replays are published so anyone can inspect or rerun the tests.

It is intended for researchers and developers building or selecting web agents who need objective data rather than marketing claims. The benchmark encourages closer scrutiny of reported performance numbers in the open-source agent space.

Capabilities

evaluate web browser agents
run reproducible evals
benchmark agent performance
test browser-based tasks

What you can build with open-operator-evals

Compare agent performance

Rank open-source web agents using consistent metrics across identical tasks and runs.

Verify published results

Reproduce evaluations locally to check claims made in blog posts or model cards.

Measure reliability under retries

Assess how consistently an agent succeeds when given multiple attempts on the same task.

Install open-operator-evals

Quick start
task: Book a journey with return option on same day from Edinburg to Manchester for Tomorrow, and show me the lowest price option available
url: https://www.google.com/travel/flights
  1. 1Clone the repository from GitHub.
  2. 2Install required Python dependencies.
  3. 3Download or prepare the WebVoyager task set.
  4. 4Run the evaluation script with chosen agents.
  5. 5Review the generated metrics and replays.

open-operator-evals: pros & cons

Pros

  • +Fully open logs and replays for every run
  • +Uses both self-report and independent LLM judging
  • +Includes reliability metric across multiple attempts
  • +Strict time and step limits prevent unrealistic results

Cons

  • Currently covers only three agents
  • Relies on a single task dataset
  • No built-in support for custom agent integration shown
Did you find this helpful?

Frequently asked questions

It shows the percentage of tasks an agent completes successfully at least once across eight attempts.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote open-operator-evals

Add this badge to your website, or share the tool.

DFeatured on Dhanasviopen-operator-evals 0