open-operator-evals
VerifiedOpen benchmark comparing top open-source web agents on real tasks.
What is open-operator-evals?
open-operator-evals is a reproducible benchmark suite that measures how well open-source web agents perform on realistic browser tasks. It executes each task multiple times under fixed constraints and records four core metrics: self-reported success, LLM-verified completion, average time, and task reliability across retries.
The evaluation uses the WebVoyager dataset of around 600 tasks. Agents operate in headless mode with strict step and time limits, allowing direct comparison of systems such as Notte, Browser-Use, and Convergence. All logs and replays are published so anyone can inspect or rerun the tests.
It is intended for researchers and developers building or selecting web agents who need objective data rather than marketing claims. The benchmark encourages closer scrutiny of reported performance numbers in the open-source agent space.
Capabilities
What you can build with open-operator-evals
Compare agent performance
Rank open-source web agents using consistent metrics across identical tasks and runs.
Verify published results
Reproduce evaluations locally to check claims made in blog posts or model cards.
Measure reliability under retries
Assess how consistently an agent succeeds when given multiple attempts on the same task.
Install open-operator-evals
task: Book a journey with return option on same day from Edinburg to Manchester for Tomorrow, and show me the lowest price option available
url: https://www.google.com/travel/flights- 1Clone the repository from GitHub.
- 2Install required Python dependencies.
- 3Download or prepare the WebVoyager task set.
- 4Run the evaluation script with chosen agents.
- 5Review the generated metrics and replays.
open-operator-evals: pros & cons
Pros
- +Fully open logs and replays for every run
- +Uses both self-report and independent LLM judging
- +Includes reliability metric across multiple attempts
- +Strict time and step limits prevent unrealistic results
Cons
- –Currently covers only three agents
- –Relies on a single task dataset
- –No built-in support for custom agent integration shown
Frequently asked questions
It shows the percentage of tasks an agent completes successfully at least once across eight attempts.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…