Skip to content

feat(verifier): record agent trajectories#2131

Open
miguelg719 wants to merge 9 commits into
miguelgonzalez/verifier-02-backend-routingfrom
miguelgonzalez/verifier-03-trajectory-recorder
Open

feat(verifier): record agent trajectories#2131
miguelg719 wants to merge 9 commits into
miguelgonzalez/verifier-02-backend-routingfrom
miguelgonzalez/verifier-03-trajectory-recorder

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The new verifier needs richer evidence than a final screenshot, especially for DOM and Hybrid agent modes where the important facts often live in tool returns, ARIA snapshots, and per-step observations. This PR adds trajectory recording without changing the verifier judgment engine.

What Changed

  • Added typed agent bus events for screenshot, step-finished, step-observed, and final-answer events.
  • Added listener-gated post-step probes for screenshots and ARIA trees.
  • Attached the settled post-turn probe to every tool call in a DOM/Hybrid turn.
  • Added CUA step evidence pairing and final answer capture.
  • Added TrajectoryRecorder persistence and a smoke script for trajectory shape and disk layout.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • node --import tsx packages/evals/scripts/verify-trajectory-recorder.ts
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

🦋 Changeset detected

Latest commit: 7ba3a3f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch
@browserbasehq/stagehand-server-v4 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Agent as Agent Handlers
    participant Bus as Event Bus
    participant Recorder as TrajectoryRecorder
    participant FS as File System
    participant Page as Browser Page

    Note over Agent,FS: NEW: Step-level evidence capture (DOM/Hybrid mode)

    Agent->>Agent: onStepFinish callback fires
    Agent->>Agent: stepCounter++ (per tool call)
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex, actionName, actionArgs, reasoning, toolOutput, finishedAt
    Agent->>Page: page.screenshot() (post-step probe)
    Page-->>Agent: screenshot Buffer
    Agent->>Agent: captureAriaTreeProbe(v3)
    Note over Agent: Best-effort, token-budgeted a11y tree capture
    Agent-->>Agent: ariaTree string | undefined
    loop For each tool call in turn
        Agent->>Bus: emit agent_screenshot_taken_event
        Note over Bus: stepIndex, screenshot, url, evidenceRole: "probe"
        Agent->>Bus: emit agent_step_observed_event
        Note over Bus: stepIndex, url, ariaTree (optional), scroll (optional)
    end
    opt done tool call present
        Agent->>Agent: Build lastFinalAnswer
        Agent->>Bus: emit agent_final_answer_event
    end

    Note over Agent,FS: NEW: Step-level evidence capture (CUA mode)

    Agent->>Agent: screenshotProvider called
    Agent->>Page: page.screenshot()
    Page-->>Agent: screenshot Buffer
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: stepIndex++, screenshot, url, evidenceRole: "agent"
    Agent->>Agent: executeAction(action)
    Agent->>Agent: emitCuaActionStep()
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex paired with preceding screenshot
    Agent->>Page: page.screenshot() (post-action probe)
    Page-->>Agent: probe screenshot
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: same stepIndex, screenshot, url, evidenceRole: "probe"
    Agent->>Agent: captureAriaTreeProbe(v3)
    Agent->>Bus: emit agent_step_observed_event
    Note over Bus: stepIndex, url, ariaTree (optional)

    Note over Agent,FS: NEW: Trajectory assembly and persistence

    Recorder->>Bus: subscribe to agent_step_finished_event
    Recorder->>Bus: subscribe to agent_screenshot_taken_event
    Recorder->>Bus: subscribe to agent_step_observed_event
    Recorder->>Bus: subscribe to agent_final_answer_event
    Bus-->>Recorder: events arrive (may be out-of-order)
    Recorder->>Recorder: ensurePartial(stepIndex)
    Recorder->>Recorder: Merge evidence into partial steps

    alt persistEnabled (env-gated by VERIFIER_PERSIST_TRAJECTORIES)
        Recorder->>Recorder: assembleSteps()
        Recorder->>FS: mkdir -p .trajectories/{runId}/{taskId}/
        Recorder->>FS: write trajectory.json
        Recorder->>FS: write core.log
        Recorder->>FS: write task_data.json
        Recorder->>FS: write times.json
        Recorder->>FS: write screenshots/probe/{1..N}.png
        alt verdict provided
            Recorder->>FS: write scores/mmrubric_v1.json
            Recorder->>FS: update task_data.json with verdict
        end
    else persistence disabled
        Recorder->>Recorder: Return in-memory Trajectory only
    end
    Recorder-->>Agent: Trajectory object
Loading

Re-trigger cubic

@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 72774c7 to da0c152 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from d7d2c59 to 2765781 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from da0c152 to d77e596 Compare May 15, 2026 21:45
Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 8e4fbe2 to 56a3465 Compare May 15, 2026 22:33
Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch 3 times, most recently from fd043bc to 635b3d2 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 60e4321 to 231a90d Compare May 18, 2026 23:54
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 635b3d2 to 0f37a65 Compare May 18, 2026 23:54
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/verifier/trajectory.ts">

<violation number="1" location="packages/core/lib/v3/verifier/trajectory.ts:256">
P1: The new on-disk image format (`imagePath`) is not compatible with `loadTrajectoryFromDisk`, so reloaded trajectories can lose tier-1 image bytes.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

await fs.writeFile(path.join(dir, relPath), m.bytes);
modalities.push({
type: "image",
imagePath: relPath,
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The new on-disk image format (imagePath) is not compatible with loadTrajectoryFromDisk, so reloaded trajectories can lose tier-1 image bytes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/verifier/trajectory.ts, line 256:

<comment>The new on-disk image format (`imagePath`) is not compatible with `loadTrajectoryFromDisk`, so reloaded trajectories can lose tier-1 image bytes.</comment>

<file context>
@@ -187,3 +189,138 @@ export async function loadTrajectoryFromDisk(dir: string): Promise<Trajectory> {
+      await fs.writeFile(path.join(dir, relPath), m.bytes);
+      modalities.push({
+        type: "image",
+        imagePath: relPath,
+        mediaType: m.mediaType,
+      });
</file context>
Fix with Cubic

miguelg719 and others added 7 commits May 18, 2026 17:36
Lift the on-disk persistence helpers from TrajectoryRecorder into
verifier/trajectory.ts so #2137's harness adapter can share them. Also
drop the recorder's no-op .replace("T","T") and the WHAT-narration
comments per project policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 231a90d to 50cdf0a Compare May 19, 2026 00:36
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from ddb59bb to 3a7ef3f Compare May 19, 2026 00:36
miguelg719 and others added 2 commits May 18, 2026 17:45
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- v3AgentHandler / v3CuaAgentHandler use optional-chained listenerCount
  so test mocks without one (captcha-hooks, temperature) don't blow up.
- Add bus stub to the agent-temperature createV3() mock so bus.emit
  doesn't NPE on the new agent_step_finished_event emit.
- Add BUS_EVENTS, shouldPersistTrajectory, writeTrajectoryDir to the
  export-surface snapshot — these are intentional new public exports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant