feat(evals): add offline verifier CLI#2134
Conversation
|
There was a problem hiding this comment.
3 issues found across 9 files
Confidence score: 3/5
- There is some concrete regression risk here:
packages/evals/framework/rubricCache.tsreaddoes not confirmparsed.taskId === taskSpec.id, so sanitized ID collisions can return the wrong cached rubric data when hashes align. - Two CLI/runtime behaviors are likely to confuse users but are straightforward to fix:
packages/evals/tui/commands/verify.tssilently ignores trailing--model/--labelwithout values, andpackages/evals/scripts/verify-live-trajectory.tspassestimeoutMstopage.goto()(ignored), causing fallback to Playwright’s default timeout. - Given one medium-severity correctness issue plus two medium input/timeout handling issues, this looks mergeable with caution after targeted fixes rather than a hard block.
- Pay close attention to
packages/evals/framework/rubricCache.ts,packages/evals/tui/commands/verify.ts, andpackages/evals/scripts/verify-live-trajectory.ts- cache key/task ID validation, missing flag-value errors, and ignored navigation timeout options need verification.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tui/commands/verify.ts">
<violation number="1" location="packages/evals/tui/commands/verify.ts:89">
P2: Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</violation>
</file>
<file name="packages/evals/framework/rubricCache.ts">
<violation number="1" location="packages/evals/framework/rubricCache.ts:95">
P2: The `read` method does not verify `parsed.taskId` matches `taskSpec.id`. Since `entryPath` sanitizes characters (`:`, `/`, etc.) to `_`, distinct task IDs can map to the same file. When instruction hashes also happen to match, a stale/wrong rubric is served silently. Add a `taskId` equality check alongside the `instructionHash` check.</violation>
</file>
<file name="packages/evals/scripts/verify-live-trajectory.ts">
<violation number="1" location="packages/evals/scripts/verify-live-trajectory.ts:38">
P2: Playwright's `page.goto()` accepts `timeout`, not `timeoutMs`. This option is silently ignored, so the navigation falls back to the default 30s timeout instead of the intended 60s.</violation>
</file>
Architecture diagram
sequenceDiagram
participant CLI as CLI / terminal
participant CmdRouter as Command Router (cli.ts)
participant VerifyCmd as verify Command
participant Trajectory as Trajectory Dir (disk)
participant RubricCache as RubricCache
participant V3Eval as V3Evaluator (verifier backend)
participant V3 as V3 instance (headless)
participant TrajectoryRecorder as TrajectoryRecorder (live)
Note over CLI,V3: NEW: Offline verify path (red arrow) vs existing live run (blue arrows)
CLI->>CmdRouter: evals verify <trajectory-dir> [options]
CmdRouter->>VerifyCmd: handleVerify(args)
VerifyCmd->>Trajectory: read trajectory.json + task_data.json
Trajectory-->>VerifyCmd: Trajectory + TaskSpec
VerifyCmd->>V3Eval: new V3Evaluator(v3, {backend:"verifier"})
Note over VerifyCmd,V3Eval: No browser launched — V3 constructed without init()
VerifyCmd->>V3Eval: verify(trajectory, taskSpec)
V3Eval->>RubricCache: getOrGenerate(taskSpec, evaluator)
alt Cache hit (same instruction hash)
RubricCache-->>V3Eval: cached Rubric
else Cache miss or hash drift
RubricCache->>RubricCache: hashInstruction(taskSpec.instruction)
V3Eval->>V3Eval: generateRubric(taskSpec) — Step 0a
V3Eval->>RubricCache: write(taskSpec, rubric)
RubricCache-->>V3Eval: freshly generated Rubric
end
V3Eval->>V3Eval: score trajectory against rubric — Step 8
V3Eval-->>VerifyCmd: Verdict (outcomeSuccess, processScore, perCriterion)
alt --json flag
VerifyCmd->>CLI: JSON stringified Verdict to stdout
else default (human summary)
VerifyCmd->>CLI: colored summary (score, criteria, findings)
alt --dry-run not set
VerifyCmd->>Trajectory: write scores/mmrubric_<label>.json
end
end
Note over CLI,Trajectory: Live run path (unchanged, shown for context)
CLI->>CmdRouter: evals run <target>
CmdRouter->>V3: agent.execute(instruction)
V3->>TrajectoryRecorder: start() — subscribe to bus events
V3->>V3: perform browser automation steps
V3->>TrajectoryRecorder: capture step events (screenshots, URLs, evidence)
V3-->>CmdRouter: agent result
TrajectoryRecorder->>Trajectory: persist() — write trajectory.json, screenshots, task_data.json, times.json
CmdRouter->>CLI: run summary
Note over CmdRouter,V3: Success mode plumbing (--success flag)
CmdRouter->>CmdRouter: resolve successMode from --success / EVAL_SUCCESS_MODE / "outcome"
CmdRouter->>V3: envOverrides.EVAL_SUCCESS_MODE = successMode
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| parsed.json = true; | ||
| } else if (a === "--dry-run") { | ||
| parsed.dryRun = true; | ||
| } else if (a === "--model") { |
There was a problem hiding this comment.
P2: Missing validation when --model or --label is passed without a following value. If either is the last argument, args[++i] is undefined and the flag is silently ignored rather than producing an error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commands/verify.ts, line 89:
<comment>Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</comment>
<file context>
@@ -0,0 +1,238 @@
+ parsed.json = true;
+ } else if (a === "--dry-run") {
+ parsed.dryRun = true;
+ } else if (a === "--model") {
+ parsed.model = args[++i];
+ } else if (a === "--label") {
</file context>
163db47 to
ebe60bf
Compare
4923ce6 to
dcc5bfc
Compare
ebe60bf to
191904b
Compare
dcc5bfc to
d736522
Compare
191904b to
62cb8db
Compare
d736522 to
cd1f8f4
Compare
a6ee702 to
2e7ff0f
Compare
cd1f8f4 to
95ada04
Compare
2e7ff0f to
b725247
Compare
4f141e7 to
c4fee88
Compare
cabb4c5 to
5f74a4b
Compare
c4fee88 to
ad66c72
Compare
5f74a4b to
f0fefe5
Compare
ad66c72 to
7d418f6
Compare
f0fefe5 to
e68e8f0
Compare
22e8732 to
b388441
Compare
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tui/commandTree.ts">
<violation number="1" location="packages/evals/tui/commandTree.ts:337">
P2: Add `return process.exit(0)` instead of bare `process.exit(0)` to satisfy eslint no-fallthrough.
The old `case "exit"` block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before `case "help"`/`case "help-q"` but the suppression was lost — causing a likely lint CI failure.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| throw new Error("`exit` is not available outside the REPL"); | ||
| } | ||
| console.log(dim("\n Goodbye.\n")); | ||
| process.exit(0); |
There was a problem hiding this comment.
P2: Add return process.exit(0) instead of bare process.exit(0) to satisfy eslint no-fallthrough.
The old case "exit" block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before case "help"/case "help-q" but the suppression was lost — causing a likely lint CI failure.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commandTree.ts, line 337:
<comment>Add `return process.exit(0)` instead of bare `process.exit(0)` to satisfy eslint no-fallthrough.
The old `case "exit"` block had an explicit eslint suppression because it relied on process.exit(0) as the terminator. After reordering, the block is now before `case "help"`/`case "help-q"` but the suppression was lost — causing a likely lint CI failure.</comment>
<file context>
@@ -337,6 +329,14 @@ async function runMeta(
+ throw new Error("`exit` is not available outside the REPL");
+ }
+ console.log(dim("\n Goodbye.\n"));
+ process.exit(0);
+ }
+
</file context>
| process.exit(0); | |
| return process.exit(0); |
623bea5 to
e5244c2
Compare
dfadc39 to
bdc21d6
Compare
e5244c2 to
c3623e8
Compare
bdc21d6 to
bfe4476
Compare
case "clear" now precedes case "exit" so there's no fall-through risk and the no-fallthrough disable comment is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c3623e8 to
0d70b72
Compare
bfe4476 to
9b66b78
Compare
Why
Verifier iteration should not require rerunning browser automation. This PR adds offline saved-trajectory rescoring so prompts, approaches, and scoring behavior can be compared against the same trajectory artifacts without changing the existing CLI architecture.
What Changed
evals verify <trajectory-dir>for offline trajectory rescoring through the existing command-tree dispatch.taskIdand instruction hash before returning cached data.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.tspnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentgit diff --check