Skip to content

imasimali/browser-tool-comparison

Repository files navigation

browser-tool-comparison

Benchmarks output size (token cost proxy) and wall time for two CLIs an AI agent might drive a browser with, in four configurations:

  • playwright-cli — baseline, default flags.
  • playwright-cli --raw snapshot --depth 6 — same tool, flags tuned to cut output.
  • agent-browser step-by-step — purpose-built AI-agent browser CLI, one subprocess per step.
  • agent-browser batch — same tool, all steps in one subprocess.

Scenario

All four variants run the same 10-step interactive scenario across three pages on https://the-internet.herokuapp.com, 3 trials each:

  1. open /login
  2. fill #username with tomsmith
  3. fill #password with SuperSecretPassword!
  4. click button[type=submit] (logs in)
  5. navigate to /dropdown
  6. select #dropdown 1 (picks "Option 1")
  7. navigate to /checkboxes
  8. click the first checkbox (toggles it)
  9. snapshot the final page
  10. close

Exercises: text inputs, submit button, <select>, checkboxes, cross-page navigation. The URLs are set at the top of each run-*.sh; swap them (and adjust the selectors) to re-run on a different site.

Prerequisites

  • playwright-clinpm i -g @playwright/cli (or equivalent).
  • agent-browserbrew install agent-browser (or npm i -g agent-browser / cargo install agent-browser), then agent-browser install.
  • python3 — used for millisecond timing in the harnesses and for analyze.py.

Run

./bench.sh
python3 analyze.py

bench.sh drives all four variants for 3 trials each. analyze.py reads raw/ and rewrites report.md.

Layout

  • bench.sh — orchestrates the 12 trial runs.
  • run-playwright.sh — baseline playwright-cli harness (outputs to raw/pw-tN/).
  • run-playwright-optimized.sh — playwright-cli with --raw + snapshot --depth 6 (outputs to raw/pwo-tN/).
  • run-agent-browser.sh — agent-browser step-by-step (outputs to raw/ab-tN/).
  • run-agent-browser-batch.sh — agent-browser all-steps-in-one-subprocess (outputs to raw/ab-batch-tN/).
  • analyze.py — aggregates raw/ into report.md.
  • report.md — generated summary (averages, per-trial table, per-step table, caveats).
  • raw/<tool>-<trial>/ — per-step *.stdout, *.stderr, and steps.log (format: step|rc=N|ms=N).

How output size is measured

analyze.py sums stdout + stderr file sizes per step. "Tokens" is the crude bytes/4 estimate — fine for comparing the tools' ratio, not for absolute budget planning. See the caveats section in report.md.

About

playwright-cli vs agent-browser benchmark

Topics

Resources

Stars

Watchers

Forks

Contributors