Step 6: baseline browser-task generalist by rockfordlhotka · Pull Request #9 · MarimerLLC/foragent

rockfordlhotka · 2026-04-23T01:12:18Z

Summary

First v0.2 milestone (spec §9.1 step 6). Adds browser-task, the LLM-in-the-loop generalist capability built natively on Microsoft.Playwright (no MCP sidecar, no Stagehand port — Appendix A #16). Uses ref-annotated aria snapshots + aria-ref=eN locator resolution.

New capability src/Foragent.Capabilities/BrowserTask/ with 7 [AIFunction] tools: snapshot, navigate, click, type, wait_for, done, fail. Budget enforced tool-side (default maxSteps=60, ceiling 150) + wall-clock CancellationTokenSource (default 120s, ceiling 600). Output is structured JSON {status, summary, result, steps, navigations}.
Foragent.Browser extended with IBrowserAgentPage for ref-based interactions, plus an allowlist-scoped CreateSessionAsync(Func<Uri,bool>, ...) overload that aborts off-list document/subframe navigations inside the context's RouteAsync — before Playwright issues the request (spec §7.1).
HostAllowlist parses bsky.app, *.example.com, and * patterns; empty lists reject.
Program.cs now wires AddRockBotTieredChatClients (one model aliased across Low/Balanced/High per spec §3.7, Appendix #17); AddRockBotChatClient call removed (the tiered registration already wraps with RockBotFunctionInvokingChatClient).
Microsoft.Playwright bumped 1.50 → 1.59 for AriaSnapshotMode.Ai (ref-annotated snapshots). Matching base-image bump deferred to release.
browser-task added to ForagentCapabilities.Skills first; existing three specialists remain (spec removes post-to-site at step 7, not 6). deploy/rockbot-seed/ updated.

Verification

dotnet build -c Release clean; dotnet test -c Release → 65 pass, 4 skipped (skipped are real-LLM integration benchmarks when FORAGENT_LLM_* is unset).
Real-LLM benchmark (step 6 floor measurement, spec §9.1): 3 / 3 scenarios pass on first attempt against Azure AI Foundry Balanced.

Scenario Result Wall-clock

Click-through ✅ done 5 s

Form submit ✅ done 8 s

Multi-page nav ✅ done 7 s
HTTP gateway smoke test via docker compose up:
- /.well-known/agent-card.json advertises all four skills.
- POST / with skill: browser-task / url: https://example.com / allowedHosts: ["example.com"] returns {"status":"done","result":"Example Domain","steps":1}.
- Off-allowlist URL (https://evil.example/) is rejected: "Starting URL host 'evil.example' is not in the allowlist."

Framework observations surfaced (not worked around — see docs/framework-feedback.md)

AddRockBotTieredChatClients subsumes AddRockBotChatClient but this is undocumented; calling both can silently swap the wrapped registration.
No per-request MaximumIterationsPerRequest path through ChatOptions — Foragent enforces step budget tool-side via BrowserTaskState.BudgetExhausted.
Microsoft.Playwright C# bindings gate ref-annotated aria snapshots behind Mode = AriaSnapshotMode.Ai rather than a boolean Ref option.

Test plan

dotnet build -c Release
dotnet test -c Release
Real-LLM benchmark scenarios (3 / 3 passing, floor established)
docker compose up --build; curl smoke test via HTTP gateway
Allowlist rejection verified end-to-end

🤖 Generated with Claude Code

First v0.2 capability. LLM-in-the-loop planner over ref-annotated aria snapshots + aria-ref=eN locator resolution, built on Microsoft.Playwright 1.59 (bumped from 1.50 for AriaSnapshotMode.Ai). Registered tiered chat clients via AddRockBotTieredChatClients with one model aliased across Low/Balanced/High per spec §3.7. - HostAllowlist with *, *.domain, exact-host patterns; empty rejects. - Context-wide RouteAsync aborts off-list document/subframe navs before Playwright issues the request (spec §7.1). - [AIFunction] tools: snapshot / navigate / click / type / wait_for / done / fail. Budget enforced tool-side (default maxSteps=60, ceiling 150) + wall-clock CancellationTokenSource (default 120s, ceiling 600). - Structured JSON output: {status, summary, result, steps, navigations}. - Unit tests: 14 new (HostAllowlist, BrowserTaskCapability via ScriptedChatClient + FakeBrowserAgentPage). - Real-LLM benchmark: 3 Kestrel scenarios, all pass on first attempt against Azure AI Foundry Balanced (5/8/7s). Establishes the unaided floor before step 7 adds priming. - Smoke-tested via docker-compose HTTP gateway (agent-card lists four skills; browser-task returns "Example Domain" in 1 step; off-list URLs rejected). Framework observations captured in docs/framework-feedback.md: AddRockBotTieredChatClients subsumes AddRockBotChatClient (undocumented); no per-request iteration cap on the function-invoking chat client; Playwright aria-ref gating via AriaSnapshotMode.Ai rather than a boolean Ref option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rockfordlhotka merged commit 041a7c5 into main Apr 23, 2026
1 check passed

rockfordlhotka deleted the step-6-browser-task branch April 23, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 6: baseline browser-task generalist#9

Step 6: baseline browser-task generalist#9
rockfordlhotka merged 1 commit intomainfrom
step-6-browser-task

rockfordlhotka commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scenario	Result	Wall-clock
Click-through	✅ done	5 s
Form submit	✅ done	8 s
Multi-page nav	✅ done	7 s

Conversation

rockfordlhotka commented Apr 23, 2026

Summary

Verification

Framework observations surfaced (not worked around — see docs/framework-feedback.md)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant