feat(evals): add offline verifier CLI by miguelg719 · Pull Request #2134 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:47Z

Why

Verifier iteration should not require rerunning browser automation. This PR adds offline saved-trajectory rescoring so prompts, approaches, and scoring behavior can be compared against the same trajectory artifacts without changing the existing CLI architecture.

What Changed

Added evals verify <trajectory-dir> for offline trajectory rescoring through the existing command-tree dispatch.
Preserved REPL quiet handling, first-run state, welcome behavior, and existing command-tree routing.
Added rubric cache utilities for generated rubrics.
Hardened rubric cache reads to verify both taskId and instruction hash before returning cached data.
Added live/offline verifier scripts.
Added TUI command parsing and help support for the verify command.
Ensured offline verification explicitly uses the verifier backend.
Removed upstream verifier references from comments.

Tests

pnpm --filter @browserbasehq/stagehand run typecheck
pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
git diff --check

changeset-bot · 2026-05-15T20:59:02Z

⚠️ No Changeset found

Latest commit: 9b66b78

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

3 issues found across 9 files

Confidence score: 3/5

There is some concrete regression risk here: packages/evals/framework/rubricCache.ts read does not confirm parsed.taskId === taskSpec.id, so sanitized ID collisions can return the wrong cached rubric data when hashes align.
Two CLI/runtime behaviors are likely to confuse users but are straightforward to fix: packages/evals/tui/commands/verify.ts silently ignores trailing --model/--label without values, and packages/evals/scripts/verify-live-trajectory.ts passes timeoutMs to page.goto() (ignored), causing fallback to Playwright’s default timeout.
Given one medium-severity correctness issue plus two medium input/timeout handling issues, this looks mergeable with caution after targeted fixes rather than a hard block.
Pay close attention to packages/evals/framework/rubricCache.ts, packages/evals/tui/commands/verify.ts, and packages/evals/scripts/verify-live-trajectory.ts - cache key/task ID validation, missing flag-value errors, and ignored navigation timeout options need verification.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tui/commands/verify.ts">

<violation number="1" location="packages/evals/tui/commands/verify.ts:89">
P2: Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</violation>
</file>

<file name="packages/evals/framework/rubricCache.ts">

<violation number="1" location="packages/evals/framework/rubricCache.ts:95">
P2: The `read` method does not verify `parsed.taskId` matches `taskSpec.id`. Since `entryPath` sanitizes characters (`:`, `/`, etc.) to `_`, distinct task IDs can map to the same file. When instruction hashes also happen to match, a stale/wrong rubric is served silently. Add a `taskId` equality check alongside the `instructionHash` check.</violation>
</file>

<file name="packages/evals/scripts/verify-live-trajectory.ts">

<violation number="1" location="packages/evals/scripts/verify-live-trajectory.ts:38">
P2: Playwright's `page.goto()` accepts `timeout`, not `timeoutMs`. This option is silently ignored, so the navigation falls back to the default 30s timeout instead of the intended 60s.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant CLI as CLI / terminal
    participant CmdRouter as Command Router (cli.ts)
    participant VerifyCmd as verify Command
    participant Trajectory as Trajectory Dir (disk)
    participant RubricCache as RubricCache
    participant V3Eval as V3Evaluator (verifier backend)
    participant V3 as V3 instance (headless)
    participant TrajectoryRecorder as TrajectoryRecorder (live)

    Note over CLI,V3: NEW: Offline verify path (red arrow) vs existing live run (blue arrows)

    CLI->>CmdRouter: evals verify <trajectory-dir> [options]
    CmdRouter->>VerifyCmd: handleVerify(args)
    VerifyCmd->>Trajectory: read trajectory.json + task_data.json
    Trajectory-->>VerifyCmd: Trajectory + TaskSpec
    VerifyCmd->>V3Eval: new V3Evaluator(v3, {backend:"verifier"})
    Note over VerifyCmd,V3Eval: No browser launched — V3 constructed without init()
    VerifyCmd->>V3Eval: verify(trajectory, taskSpec)
    V3Eval->>RubricCache: getOrGenerate(taskSpec, evaluator)
    alt Cache hit (same instruction hash)
        RubricCache-->>V3Eval: cached Rubric
    else Cache miss or hash drift
        RubricCache->>RubricCache: hashInstruction(taskSpec.instruction)
        V3Eval->>V3Eval: generateRubric(taskSpec) — Step 0a
        V3Eval->>RubricCache: write(taskSpec, rubric)
        RubricCache-->>V3Eval: freshly generated Rubric
    end
    V3Eval->>V3Eval: score trajectory against rubric — Step 8
    V3Eval-->>VerifyCmd: Verdict (outcomeSuccess, processScore, perCriterion)
    alt --json flag
        VerifyCmd->>CLI: JSON stringified Verdict to stdout
    else default (human summary)
        VerifyCmd->>CLI: colored summary (score, criteria, findings)
        alt --dry-run not set
            VerifyCmd->>Trajectory: write scores/mmrubric_<label>.json
        end
    end

    Note over CLI,Trajectory: Live run path (unchanged, shown for context)
    CLI->>CmdRouter: evals run <target>
    CmdRouter->>V3: agent.execute(instruction)
    V3->>TrajectoryRecorder: start() — subscribe to bus events
    V3->>V3: perform browser automation steps
    V3->>TrajectoryRecorder: capture step events (screenshots, URLs, evidence)
    V3-->>CmdRouter: agent result
    TrajectoryRecorder->>Trajectory: persist() — write trajectory.json, screenshots, task_data.json, times.json
    CmdRouter->>CLI: run summary

    Note over CmdRouter,V3: Success mode plumbing (--success flag)
    CmdRouter->>CmdRouter: resolve successMode from --success / EVAL_SUCCESS_MODE / "outcome"
    CmdRouter->>V3: envOverrides.EVAL_SUCCESS_MODE = successMode

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T21:04:13Z

+      parsed.json = true;
+    } else if (a === "--dry-run") {
+      parsed.dryRun = true;
+    } else if (a === "--model") {


P2: Missing validation when --model or --label is passed without a following value. If either is the last argument, args[++i] is undefined and the flag is silently ignored rather than producing an error.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commands/verify.ts, line 89: <comment>Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</comment> <file context> @@ -0,0 +1,238 @@ + parsed.json = true; + } else if (a === "--dry-run") { + parsed.dryRun = true; + } else if (a === "--model") { + parsed.model = args[++i]; + } else if (a === "--label") { </file context>

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tui/commandTree.ts">

<violation number="1" location="packages/evals/tui/commandTree.ts:337">
P2: Add `return process.exit(0)` instead of bare `process.exit(0)` to satisfy eslint no-fallthrough.

The old `case "exit"` block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before `case "help"`/`case "help-q"` but the suppression was lost — causing a likely lint CI failure.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-05-19T00:31:32Z

+        throw new Error("`exit` is not available outside the REPL");
+      }
+      console.log(dim("\n  Goodbye.\n"));
+      process.exit(0);


P2: Add return process.exit(0) instead of bare process.exit(0) to satisfy eslint no-fallthrough.

The old case "exit" block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before case "help"/case "help-q" but the suppression was lost — causing a likely lint CI failure.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commandTree.ts, line 337: <comment>Add `return process.exit(0)` instead of bare `process.exit(0)` to satisfy eslint no-fallthrough. The old `case "exit"` block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before `case "help"`/`case "help-q"` but the suppression was lost — causing a likely lint CI failure.</comment> <file context> @@ -337,6 +329,14 @@ async function runMeta( + throw new Error("`exit` is not available outside the REPL"); + } + console.log(dim("\n Goodbye.\n")); + process.exit(0); + } + </file context>

Suggested change

process.exit(0);

return process.exit(0);

case "clear" now precedes case "exit" so there's no fall-through risk and the no-fallthrough disable comment is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 163db47 to ebe60bf Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from 4923ce6 to dcc5bfc Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from ebe60bf to 191904b Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from dcc5bfc to d736522 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 191904b to 62cb8db Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d736522 to cd1f8f4 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch 2 times, most recently from a6ee702 to 2e7ff0f Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from cd1f8f4 to 95ada04 Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 2e7ff0f to b725247 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from 4f141e7 to c4fee88 Compare May 18, 2026 00:13

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from cabb4c5 to 5f74a4b Compare May 18, 2026 23:54

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from c4fee88 to ad66c72 Compare May 18, 2026 23:54

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 5f74a4b to f0fefe5 Compare May 19, 2026 00:01

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from ad66c72 to 7d418f6 Compare May 19, 2026 00:01

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from f0fefe5 to e68e8f0 Compare May 19, 2026 00:02

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from 22e8732 to b388441 Compare May 19, 2026 00:26

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 623bea5 to e5244c2 Compare May 19, 2026 00:34

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from dfadc39 to bdc21d6 Compare May 19, 2026 00:34

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from e5244c2 to c3623e8 Compare May 19, 2026 00:36

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from bdc21d6 to bfe4476 Compare May 19, 2026 00:36

miguelg719 added 3 commits May 18, 2026 17:49

feat(evals): add offline verifier CLI

95979e3

fix(evals): use camel raw verifier metadata

4cdf768

fix(evals): restore command tree verifier cli

fe1f0b6

miguelg719 and others added 4 commits May 18, 2026 17:49

fix(evals): include doctor in restored help

ff0e784

docs(evals): remove rollout comments from offline verifier

9b2aa39

fix(evals): align offline verifier result naming

b4a9f3a

refactor(tui): reorder case "clear"/case "exit" to drop eslint-disable

9b66b78

case "clear" now precedes case "exit" so there's no fall-through risk and the no-fallthrough disable comment is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from c3623e8 to 0d70b72 Compare May 19, 2026 00:49

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from bfe4476 to 9b66b78 Compare May 19, 2026 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add offline verifier CLI#2134

feat(evals): add offline verifier CLI#2134
miguelg719 wants to merge 7 commits into
miguelgonzalez/verifier-05-core-enginefrom
miguelgonzalez/verifier-06-offline-cli

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 19, 2026 •

edited

Loading