Benchmarks output size (token cost proxy) and wall time for two CLIs an AI agent might drive a browser with, in four configurations:
playwright-cli— baseline, default flags.playwright-cli --raw snapshot --depth 6— same tool, flags tuned to cut output.agent-browserstep-by-step — purpose-built AI-agent browser CLI, one subprocess per step.agent-browser batch— same tool, all steps in one subprocess.
All four variants run the same 10-step interactive scenario across three pages on https://the-internet.herokuapp.com, 3 trials each:
- open
/login - fill
#usernamewithtomsmith - fill
#passwordwithSuperSecretPassword! - click
button[type=submit](logs in) - navigate to
/dropdown select #dropdown 1(picks "Option 1")- navigate to
/checkboxes - click the first checkbox (toggles it)
- snapshot the final page
- close
Exercises: text inputs, submit button, <select>, checkboxes, cross-page navigation. The URLs are set at the top of each run-*.sh; swap them (and adjust the selectors) to re-run on a different site.
playwright-cli—npm i -g @playwright/cli(or equivalent).agent-browser—brew install agent-browser(ornpm i -g agent-browser/cargo install agent-browser), thenagent-browser install.python3— used for millisecond timing in the harnesses and foranalyze.py.
./bench.sh
python3 analyze.py
bench.sh drives all four variants for 3 trials each. analyze.py reads raw/ and rewrites report.md.
bench.sh— orchestrates the 12 trial runs.run-playwright.sh— baseline playwright-cli harness (outputs toraw/pw-tN/).run-playwright-optimized.sh— playwright-cli with--raw+snapshot --depth 6(outputs toraw/pwo-tN/).run-agent-browser.sh— agent-browser step-by-step (outputs toraw/ab-tN/).run-agent-browser-batch.sh— agent-browser all-steps-in-one-subprocess (outputs toraw/ab-batch-tN/).analyze.py— aggregatesraw/intoreport.md.report.md— generated summary (averages, per-trial table, per-step table, caveats).raw/<tool>-<trial>/— per-step*.stdout,*.stderr, andsteps.log(format:step|rc=N|ms=N).
analyze.py sums stdout + stderr file sizes per step. "Tokens" is the crude bytes/4 estimate — fine for comparing the tools' ratio, not for absolute budget planning. See the caveats section in report.md.