feat(verifier): record agent trajectories#2131
Open
miguelg719 wants to merge 9 commits into
Open
Conversation
🦋 Changeset detectedLatest commit: 7ba3a3f The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
There was a problem hiding this comment.
No issues found across 8 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Agent as Agent Handlers
participant Bus as Event Bus
participant Recorder as TrajectoryRecorder
participant FS as File System
participant Page as Browser Page
Note over Agent,FS: NEW: Step-level evidence capture (DOM/Hybrid mode)
Agent->>Agent: onStepFinish callback fires
Agent->>Agent: stepCounter++ (per tool call)
Agent->>Bus: emit agent_step_finished_event
Note over Bus: stepIndex, actionName, actionArgs, reasoning, toolOutput, finishedAt
Agent->>Page: page.screenshot() (post-step probe)
Page-->>Agent: screenshot Buffer
Agent->>Agent: captureAriaTreeProbe(v3)
Note over Agent: Best-effort, token-budgeted a11y tree capture
Agent-->>Agent: ariaTree string | undefined
loop For each tool call in turn
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: stepIndex, screenshot, url, evidenceRole: "probe"
Agent->>Bus: emit agent_step_observed_event
Note over Bus: stepIndex, url, ariaTree (optional), scroll (optional)
end
opt done tool call present
Agent->>Agent: Build lastFinalAnswer
Agent->>Bus: emit agent_final_answer_event
end
Note over Agent,FS: NEW: Step-level evidence capture (CUA mode)
Agent->>Agent: screenshotProvider called
Agent->>Page: page.screenshot()
Page-->>Agent: screenshot Buffer
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: stepIndex++, screenshot, url, evidenceRole: "agent"
Agent->>Agent: executeAction(action)
Agent->>Agent: emitCuaActionStep()
Agent->>Bus: emit agent_step_finished_event
Note over Bus: stepIndex paired with preceding screenshot
Agent->>Page: page.screenshot() (post-action probe)
Page-->>Agent: probe screenshot
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: same stepIndex, screenshot, url, evidenceRole: "probe"
Agent->>Agent: captureAriaTreeProbe(v3)
Agent->>Bus: emit agent_step_observed_event
Note over Bus: stepIndex, url, ariaTree (optional)
Note over Agent,FS: NEW: Trajectory assembly and persistence
Recorder->>Bus: subscribe to agent_step_finished_event
Recorder->>Bus: subscribe to agent_screenshot_taken_event
Recorder->>Bus: subscribe to agent_step_observed_event
Recorder->>Bus: subscribe to agent_final_answer_event
Bus-->>Recorder: events arrive (may be out-of-order)
Recorder->>Recorder: ensurePartial(stepIndex)
Recorder->>Recorder: Merge evidence into partial steps
alt persistEnabled (env-gated by VERIFIER_PERSIST_TRAJECTORIES)
Recorder->>Recorder: assembleSteps()
Recorder->>FS: mkdir -p .trajectories/{runId}/{taskId}/
Recorder->>FS: write trajectory.json
Recorder->>FS: write core.log
Recorder->>FS: write task_data.json
Recorder->>FS: write times.json
Recorder->>FS: write screenshots/probe/{1..N}.png
alt verdict provided
Recorder->>FS: write scores/mmrubric_v1.json
Recorder->>FS: update task_data.json with verdict
end
else persistence disabled
Recorder->>Recorder: Return in-memory Trajectory only
end
Recorder-->>Agent: Trajectory object
72774c7 to
da0c152
Compare
d7d2c59 to
2765781
Compare
da0c152 to
d77e596
Compare
miguelg719
commented
May 15, 2026
8e4fbe2 to
56a3465
Compare
miguelg719
commented
May 15, 2026
fd043bc to
635b3d2
Compare
60e4321 to
231a90d
Compare
635b3d2 to
0f37a65
Compare
Contributor
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/verifier/trajectory.ts">
<violation number="1" location="packages/core/lib/v3/verifier/trajectory.ts:256">
P1: The new on-disk image format (`imagePath`) is not compatible with `loadTrajectoryFromDisk`, so reloaded trajectories can lose tier-1 image bytes.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| await fs.writeFile(path.join(dir, relPath), m.bytes); | ||
| modalities.push({ | ||
| type: "image", | ||
| imagePath: relPath, |
Contributor
There was a problem hiding this comment.
P1: The new on-disk image format (imagePath) is not compatible with loadTrajectoryFromDisk, so reloaded trajectories can lose tier-1 image bytes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/verifier/trajectory.ts, line 256:
<comment>The new on-disk image format (`imagePath`) is not compatible with `loadTrajectoryFromDisk`, so reloaded trajectories can lose tier-1 image bytes.</comment>
<file context>
@@ -187,3 +189,138 @@ export async function loadTrajectoryFromDisk(dir: string): Promise<Trajectory> {
+ await fs.writeFile(path.join(dir, relPath), m.bytes);
+ modalities.push({
+ type: "image",
+ imagePath: relPath,
+ mediaType: m.mediaType,
+ });
</file context>
Lift the on-disk persistence helpers from TrajectoryRecorder into verifier/trajectory.ts so #2137's harness adapter can share them. Also drop the recorder's no-op .replace("T","T") and the WHAT-narration comments per project policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
231a90d to
50cdf0a
Compare
ddb59bb to
3a7ef3f
Compare
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- v3AgentHandler / v3CuaAgentHandler use optional-chained listenerCount so test mocks without one (captcha-hooks, temperature) don't blow up. - Add bus stub to the agent-temperature createV3() mock so bus.emit doesn't NPE on the new agent_step_finished_event emit. - Add BUS_EVENTS, shouldPersistTrajectory, writeTrajectoryDir to the export-surface snapshot — these are intentional new public exports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The new verifier needs richer evidence than a final screenshot, especially for DOM and Hybrid agent modes where the important facts often live in tool returns, ARIA snapshots, and per-step observations. This PR adds trajectory recording without changing the verifier judgment engine.
What Changed
TrajectoryRecorderpersistence and a smoke script for trajectory shape and disk layout.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand-evals run typechecknode --import tsx packages/evals/scripts/verify-trajectory-recorder.tsgit diff --check