From e0648bc9ac5a64669c0d128ad77db4904b8a55a9 Mon Sep 17 00:00:00 2001 From: Rockford Lhotka Date: Tue, 21 Apr 2026 22:08:59 -0500 Subject: [PATCH] Adopt spec v0.2: agentic generalist direction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pivots Foragent from five narrow verbs to one generalist browser-task capability plus a small set of fast-path specialists. Step 5 showed hand-written site-specific code doesn't scale and that structured typed skills are hostile to the natural-language callers (mostly other LLM agents) Foragent actually has. §5 wholesale rewrite: two-tier capability model (§5.1), v0.2 initial set with browser-task as the generalist (§5.2), multi-phase flows with returned artifacts (§5.5), learning substrate on RockBot's ISkillStore + ILongTermMemory (§5.6), human-in-the-loop explicitly caller-side (§5.7). §3.7 adds LLM tier routing via RockBot's TieredChatClientRegistry. §7.1 makes allowlists mandatory with wildcard support. §9.1 adds steps 6-9; §9.2 drops the Stagehand exclusion. §12 closes Q1/Q2; adds Q6/Q7/Q8 for the step 6-8 work. Appendix A gains decisions #16-#20: direct-SDK (no MCP/Stagehand), tier routing, mandatory wildcarded allowlists, framework persistence for learned knowledge, multi-phase as separate tasks. Working doc from the direction-setting discussion archived to docs/archive/foragent-spec-v0.2-proposal.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 2 +- docs/archive/foragent-spec-v0.2-proposal.md | 538 ++++++++++++++++++++ docs/foragent-specification.md | 391 ++++++++++---- 3 files changed, 839 insertions(+), 92 deletions(-) create mode 100644 docs/archive/foragent-spec-v0.2-proposal.md diff --git a/CLAUDE.md b/CLAUDE.md index 2b5d1c7..1159923 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Status -Foragent is at **milestone 5** (spec §9.1): the A2A surface is wired end-to-end against RockBot as the first real user via the `docker-compose.yml` harness, pinned to `rockylhotka/rockbot-agent:0.8.5`. Three capabilities are exercised — `fetch-page-title` (step 2, Playwright), `extract-structured-data` (step 3, Playwright + LLM), and `post-to-site` (step 4, Playwright + credential broker). Validation was scoped to "poster dispatches" — real Bluesky posting requires populating `FORAGENT_BLUESKY_*` in `.env` and is not yet covered by the milestone. Storage-state persistence, 2FA input-required flow, k8s-secrets broker, and per-tenant credential namespaces are still deferred — tracked in `docs/framework-feedback.md` step 4. The authoritative design document is `docs/foragent-specification.md` — read it before making non-trivial changes. Framework-level observations from each milestone are captured in `docs/framework-feedback.md`. +Foragent is at **milestone 5 shipped, v0.2 spec adopted, step 6 next**. Three capabilities are live (`fetch-page-title`, `extract-structured-data`, `post-to-site`); the A2A loop is wired end-to-end against RockBot via the `docker-compose.yml` harness pinned to `rockylhotka/rockbot-agent:0.8.5`. The governing spec is now `docs/foragent-specification.md` **v0.2** — read it before making non-trivial changes. v0.2 pivots Foragent to an agentic model: one generalist `browser-task` capability (built natively on `Microsoft.Playwright` NuGet — no MCP sidecar, no Stagehand port — see Appendix A #16) plus narrow fast-path specialists, with RockBot's `ISkillStore` + `ILongTermMemory` as the learning substrate. The v0.1 proposal document is archived at `docs/archive/foragent-spec-v0.2-proposal.md`. Storage-state persistence, 2FA input-required flow, k8s-secrets broker, and per-tenant credential namespaces remain deferred — tracked in `docs/framework-feedback.md` step 4. Framework-level observations from each milestone are captured in `docs/framework-feedback.md`. ## Build / test diff --git a/docs/archive/foragent-spec-v0.2-proposal.md b/docs/archive/foragent-spec-v0.2-proposal.md new file mode 100644 index 0000000..ed246d5 --- /dev/null +++ b/docs/archive/foragent-spec-v0.2-proposal.md @@ -0,0 +1,538 @@ +# Foragent Specification v0.2 — Proposal + +> **Status:** Proposed revision to `foragent-specification.md`. Not yet merged. +> **Date:** April 2026 +> **Author:** Rocky Lhotka / session notes from step 5 retrospective + +This document proposes a direction change for Foragent after completing +milestone 5. It captures the revised product vision, the implied spec +edits, and a step-by-step implementation plan for milestones 6 through 9. + +The v0.1 spec is still the governing document. Once this proposal is +reviewed and approved, the changes below are folded into +`foragent-specification.md` and this file is archived. + +--- + +## Part 1 — What's changing and why + +### What we learned in v0.1 (milestones 1–5) + +Milestones 1–5 shipped three narrow capabilities (`fetch-page-title`, +`extract-structured-data`, `post-to-site`) and validated the RockBot +framework boundary end-to-end. They also surfaced a design question the +spec didn't anticipate: **how does Foragent scale to N websites without +N site-specific capability implementations?** + +The step-4 `BlueskySitePoster` was a deliberate probe — ship one +hand-written site poster, learn what it costs to add the second. The +answer: it costs a full `ISitePoster` subclass, a CSS selector audit, a +fake-server integration test, and a re-verification every time the site +redesigns. That doesn't scale, and it isn't the product. + +Step 5 also exposed a second mismatch. RockBot's `invoke_agent` tool +passes a single free-text `message` argument to the called agent. Its +LLM tried to invoke `post-to-site` with `message="Create a post on +Bluesky with the text: ..."` — not a structured `{site, credentialId, +content}` object. Narrow typed skills are hostile to natural-language +callers. + +### What v0.2 is + +Foragent becomes an **agentic browser agent**: given a free-form intent +and a target URL, it plans and drives the browser to fulfill the intent, +using internal LLM reasoning to resolve selectors, form structure, and +retry strategy. Site-specific code is the exception, not the rule. + +This is what v0.1 §9.2 called "Stagehand-style natural-language-to-action +layers" and flagged as "may be revisited later, v1 selector-resolution +is sufficient." It wasn't. v0.2 revisits. + +### What v0.2 is *not* + +v0.2 is not a .NET port of Stagehand, and does not run Stagehand or +`@playwright/mcp` as a Node sidecar. Direct integration was evaluated +against the `Microsoft.Playwright` NuGet path already in use since +milestone 2, and the NuGet path won on every relevant axis: + +- **Ref-annotated aria snapshots are a Playwright feature, not an MCP + feature.** `Page.AriaSnapshotAsync()` already emits stable `[ref=e42]` + markers, and `Page.Locator("aria-ref=e42")` resolves them. The LLM + picks refs the same way it would through MCP, with no process hop. +- **Tool-schema wrapping is a trivial amount of C#.** Exposing + `snapshot`/`click`/`type`/`navigate`/`wait_for` as `[AIFunction]` + methods on an injected planner surface gives `IChatClient` the same + auto-discovered tool-calling experience MCP would, without the + JSON-RPC protocol overhead. +- **Session state already lives in `Foragent.Browser`.** `IBrowserSession` + / `IBrowserPage` own the shared browser and per-task `BrowserContext` + per spec §3.5. Moving to MCP would rebuild that management on the + far side of a process boundary. +- **Spec §6's credential boundary stays clean.** A Node sidecar handling + browser actions would also handle credential material (login flows, + form values). Keeping the inner layer in-process means credentials + never cross a process boundary — the §6.1 blast-radius guarantee + holds as written. +- **Spec §3.4 Decision #1 survives unchanged.** The v0.1 "Playwright via + NuGet, not via MCP server container" decision was made for the same + reasons; v0.2's agentic model does not invalidate any of them. + +This closes v0.1 §12 open question #2 ("Stagehand-equivalent for .NET") +as **build natively, not integrate or port.** + +### What stays + +- **A2A-native, RockBot-framework-hosted, self-hosted.** Unchanged. +- **Credentials by reference via `ICredentialBroker`.** Unchanged; still + the design v0.1 got right. +- **One shared browser, fresh `BrowserContext` per task.** Unchanged. +- **Prohibited-capability list in §7.3** (no account creation, no + financial transactions, no security-permission changes). Unchanged; + arguably more load-bearing under the broader model. + +### What changes + +- **§5 Capability surface.** The initial five-verb list becomes a + two-tier model: one generalist capability plus a small set of narrow + fast-path specialists. `BlueskySitePoster` becomes a regression test, + not a template. +- **New §5.5 Multi-phase flows.** First-class support for "learn then + execute" patterns with typed intermediate artifacts. +- **New §5.6 Learning substrate.** Foragent uses RockBot framework's + `ISkillStore` + `ILongTermMemory` to persist learned site knowledge + and retrieve it on subsequent tasks. +- **New §5.7 Human-in-the-loop.** Explicit statement that review gates + are the caller's responsibility; Foragent returns structured state. +- **§7 Security.** Tighter: a generalist capability needs per-task + domain allowlist + intent policy enforcement, not just "refuse to + navigate off-allowlist." +- **§9 Sequencing.** Steps 6–9 added. +- **§9.2 Out of scope.** Stagehand-style exclusion removed. + +--- + +## Part 2 — Proposed revised sections + +### §5 Capability surface (replacement for current §5.1–§5.4) + +#### §5.1 Capability model + +Foragent exposes capabilities at two tiers: + +1. **Generalist.** One capability (`browser-task`) that accepts + free-form intent plus optional URL and credential hints. Runs an + LLM-in-the-loop planner over the browser primitives, using any + learned site knowledge from the skills / memory store as priming. + This is the default surface — the thing most callers should invoke. + +2. **Fast-path specialists.** A small set of narrow, structured + capabilities that do one well-defined thing cheaply and + deterministically. `fetch-page-title` and `extract-structured-data` + are specialists. New specialists are added only when usage shows a + consistent, high-volume pattern that benefits from a typed interface + (e.g. "get the product price from an e-commerce page" if that + genuinely becomes the 10%-of-all-calls shape — which it probably + won't). + +Most real callers are themselves LLM agents. They'll default to the +generalist. Specialists exist to keep deterministic, programmatic +callers cheap — not to proliferate. + +#### §5.2 Initial capability set (v0.2) + +| Capability | Tier | Description | +|------------|------|-------------| +| `browser-task` | Generalist | Given intent + optional URL/credential, plan and drive the browser to fulfill the intent. Uses RockBot skills + memory as priming knowledge. Returns structured result or intermediate artifact. | +| `learn-form-schema` | Specialist (phase-1) | Given a URL (and optional credential to log in first), introspect a form and return its schema — fields, types, dropdown dependencies, validation rules. Persists the schema as a skill for later reuse. Returns the schema to the caller. | +| `execute-form-batch` | Specialist (phase-2) | Given a previously-learned schema (by id or inline) and a batch of row data, submit the form once per row. Streams progress. Handles partial failure. | +| `fetch-page-title` | Specialist | (Existing, milestone 2) Return the `` of a URL. | +| `extract-structured-data` | Specialist | (Existing, milestone 3) Extract structured data from a page matching a natural-language description. | + +The v0.1 `post-to-site` capability ships in the main codebase as a +regression test for step-4 credential handling. It is not advertised in +new agent-card skill lists after step 7; `browser-task` subsumes its +function. + +`monitor-page` and `fill-form` from v0.1 §5.1 fold into `browser-task`. + +#### §5.3 Capabilities explicitly out of scope (v1) + +- Test automation (Playwright already does this). +- Raw browser primitive exposure (Microsoft's `playwright/mcp` does this). +- Visual regression testing. +- Form-filling for sensitive financial transactions, account creation, + or modifying security permissions (see §7.3). +- Multi-tab orchestration as a primary feature (may be used internally + but not advertised). +- Code generation from browser traces (e.g. "generate a Playwright + script that reproduces this"). Traces stay inside the learning + substrate. + +> ~~Stagehand-style natural-language-to-action layers~~ — removed. +> `browser-task` is that layer. + +#### §5.4 Capability design principles + +- **Task-level, not action-level.** Unchanged from v0.1. +- **Clear contracts even for the generalist.** `browser-task`'s input + shape is typed (intent, url?, credentialId?, allowlist?, budget?); + only the *plan* inside is LLM-generated. +- **Return structured state, not narrative, when the caller needs to + act on it.** A learned form schema is JSON, not prose. A submit-batch + progress report is a typed status update, not a sentence. +- **Delegate to the learning substrate, don't reinvent it.** Site + knowledge lives in RockBot skills + memory; the capability reads and + writes, it doesn't own its own cache. +- **Credentials by reference.** Unchanged. + +### §5.5 Multi-phase flows (new) + +Many real browser tasks are multi-phase with human review between +phases. The canonical example (motivating this revision): + +1. **Phase 1 — Learn.** Navigate to a form; introspect its fields and + dynamic dependencies; return a schema to the caller. +2. **Review.** The caller (human via Claude Code, or another agent) + inspects the schema, decides whether to proceed, assembles input + data, validates. +3. **Phase 2 — Execute.** Submit the form N times against the learned + schema, streaming progress. + +Foragent's role is Phase 1 and Phase 3. Phase 2 (review) is the +caller's responsibility — Foragent is not in the review loop. + +To make this work: + +- Phase-1 capabilities **return structured artifacts** (form schemas, + extracted data, observed flow traces), not just status text. +- Phase-1 artifacts are **persisted in the learning substrate** (§5.6) + so Phase 3 doesn't re-learn. They get an id the caller can reference. +- Phase-3 capabilities **accept a learned-artifact reference or inline + artifact** as input, alongside the per-invocation data. +- Phase-3 capabilities **stream progress and handle partial failure** + over A2A — not batch-atomic. + +This is not an A2A protocol change. A2A 1.0 already supports structured +response parts, streaming status updates, and task-id references. v0.2 +makes explicit use of all three. + +### §5.6 Learning substrate (new) + +Foragent uses the RockBot framework's existing persistence for learned +site knowledge, rather than building a Foragent-local store. + +**What's used:** + +- **`ISkillStore` (file-backed, BM25 + optional semantic retrieval).** + Stores site knowledge as markdown skills. Two origin categories: + - **Human-authored skills** — operator-written primers for a site + (e.g. `sites/bsky.app/overview`). Treated as priming hints for the + generalist planner. + - **Agent-learned skills** — written by the generalist on successful + task completion (e.g. `sites/bsky.app/learned/login-flow`). Tagged + with `metadata.source = "agent-learned"` and an importance score. +- **`ILongTermMemory` (file-backed, BM25 + semantic).** Declarative + observations that don't fit the procedural skill shape: failed + attempts, site-version notes, ambient facts ("bsky.app's home feed + heading is the login success signal"). + +**What's stored (skill shape):** + +- **Content:** markdown body describing the site, the flow, selectors, + success signals, known pitfalls. +- **Name convention:** `sites/{host}/{phase-or-intent}` — e.g. + `sites/bsky.app/login`, `sites/bsky.app/compose-post`. Hierarchical + `/` nesting supported by the store. +- **`seeAlso` links** across skills for the same site, so retrieval + surfaces a small knowledge cluster rather than one skill at a time. + +**What's stored (memory shape for non-procedural facts):** + +- **Category:** `sites/{host}` — so all site observations are + retrievable together. +- **Tags:** freeform (`selector`, `flow`, `failure`, `version`, etc.). +- **Importance:** ranked 0–1. Confirmed-working patterns get high + importance; one-off observations start low and drift with reuse. + +**Retrieval pattern at plan time:** + +1. Generalist capability computes a search query from the task intent + + target URL host. +2. Queries skill store and memory store in parallel, top-K by relevance. +3. Retrieved content becomes priming context for the LLM planner. +4. Planner proceeds; any new observation surfaces as a write after the + task completes. + +**Structured artifacts (the form-schema case):** + +Learned form schemas are typed JSON, not markdown. RockBot's skill store +holds markdown content. Two options, decision deferred to step 8: + +- **(A)** Embed the JSON in a fenced code block inside the skill + content. Loose — re-parse on retrieval. +- **(B)** Store the schema in an adjacent Foragent-local typed store, + reference it by id from a skill. Tighter but duplicates infrastructure. + +The framework-feedback log records (A) vs (B) as a candidate +`ISkillStore.AttachedArtifacts` extension for RockBot, if we hit the +shape often enough. + +### §5.7 Human-in-the-loop (new) + +Review gates are the **caller's** responsibility, not Foragent's. + +- Foragent returns structured state at phase boundaries (§5.5). +- The caller decides whether to proceed. Human callers use their own UI + (Claude Code, Blazor proxy, bespoke dashboards). Agent callers make + the decision programmatically. +- Foragent does **not** block waiting for review. Each phase is a + separate A2A task. + +A2A's `input-required` state is still used for credential 2FA prompts +(§6.6). It is **not** used as a general "stop and let the human review" +mechanism — that coupling would force Foragent to hold browser state +across potentially-long human delays, which conflicts with the +one-context-per-task model (§3.5). + +### §7.1 Domain allowlists (augmented) + +Under v0.2, allowlists become more load-bearing because the generalist +can navigate anywhere the LLM plans to navigate. Every `browser-task` +invocation: + +- **MUST** accept an explicit `allowedHosts` list (empty = reject). +- **MUST** refuse any navigation, fetch, or subframe load outside the + list. +- **SHOULD** have per-tenant defaults (future: §7.5) so individual tasks + can inherit rather than list everything. + +Ad-hoc "navigate to whatever looks relevant" is explicitly not +supported. The generalist is powerful but bounded. + +### §9.1 Milestones (extended) + +Existing milestones 1–5: unchanged (shipped). + +6. **Baseline `browser-task` generalist.** LLM-in-the-loop planner over + existing browser primitives. No learning substrate yet. Measure + unaided success rate on a small curated benchmark (e.g. 10 varied + sites). Goal: establish the floor before investing in priming. + +7. **Wire RockBot skills + memory as priming.** Register + `ISkillStore` + `ILongTermMemory` in Foragent's host. Retrieve + relevant skills into planner context. Write agent-learned skills on + success. Goal: prove the framework's persistence surface is the + right substrate; file issues if it isn't. + +8. **`learn-form-schema` + `execute-form-batch`.** First explicit + multi-phase capability pair. Structured JSON schema returned from + phase 1, batch execution streaming progress in phase 2. + +9. **Deprecate narrow specialists that `browser-task` covers.** Remove + `post-to-site` from the advertised skill list (keep as regression + test). Review whether `fetch-page-title` / `extract-structured-data` + still pay their way or fold into `browser-task` with equivalent + cost. Goal: land on the minimum capability set that v0.2 actually + needs. + +### §9.2 Out of scope (v1, revised) + +Unchanged except: + +- ~~Stagehand-style natural-language-to-action layers~~ — **removed.** + `browser-task` is that layer. + +### §12 Open questions (revised) + +1. **Internal LLM selection and tier routing.** (Unchanged from v0.1.) +2. ~~Stagehand-equivalent for .NET.~~ — **closed.** v0.2 builds it + natively on `Microsoft.Playwright` NuGet, not via Stagehand port or + `@playwright/mcp` sidecar. See Part 1 "What v0.2 is not." +3. **Storage state encryption at rest.** (Unchanged.) +4. **Capability versioning.** (Unchanged.) +5. **Tenant identity model.** (Unchanged.) +6. **(New) Structured artifacts in `ISkillStore`.** Do we stretch the + skill-as-markdown shape to carry typed JSON (fenced code blocks, + parse on retrieval), or add a parallel Foragent-local typed store? + Decide at step 8 based on how ugly (A) feels in practice. +7. **(New) Per-task budget.** How do we cap an LLM-in-the-loop task — + max steps, max tokens, wall-clock, cost? Caller-specified, agent- + enforced, or both? Needed by step 6. +8. **(New) Retry and failure semantics for batches.** In + `execute-form-batch`, is a row failure fatal or per-row? Does the + caller get per-row errors streamed, or a final report? Needed by + step 8. + +--- + +## Part 3 — Implementation plan + +### Step 6 — Baseline `browser-task` generalist + +**Goal:** prove the LLM-in-the-loop-over-browser-primitives baseline +works on real sites without learned priming. Establish the floor. + +**Deliverables:** +- New `BrowserTaskCapability : ICapability` with skill id `browser-task`. +- Typed input: `{intent: string, url: string?, credentialId: string?, + allowedHosts: string[], maxSteps: int?}`. +- Pure .NET planner, no Node sidecar, no MCP transport. Built on + `Microsoft.Playwright` NuGet directly via a new `Foragent.Planner` + project that consumes `IBrowserPage` from `Foragent.Browser`. +- Snapshot/action bridge: extend `IBrowserPage` (or add a sibling + `IBrowserPlannerPage`) with `AriaSnapshotAsync()` returning + ref-annotated aria text, plus `ResolveRefAsync("e42")` returning an + `ILocator`. Playwright already emits `[ref=eN]` markers in aria + snapshots and accepts `aria-ref=eN` in the selector engine — we're + exposing that, not reimplementing it. +- Planner loop: snapshot → LLM selects next action → dispatch via ref → + repeat until the planner emits a terminal action or max-steps is hit. +- LLM contract: a small `[AIFunction]` tool set — `snapshot`, `click`, + `type`, `navigate`, `wait_for`, `done`, `fail` — surfaced through + `IChatClient`'s native function-calling. No MCP JSON-RPC layer. +- Per-task allowlist enforced on every `navigate` before Playwright + sees the URL. +- Integration test: real Kestrel host with a fixed form; `browser-task` + fills it via free-text intent. +- Framework-feedback update: what the planner loop wanted from the + framework that wasn't there. + +**Out of scope for step 6:** +- Any learning / persistence. +- Multi-phase / returned artifacts. +- Credentials beyond what `IBrowserSession` already supports. + +**Exit criteria:** +- `browser-task` completes the step-4 Bluesky poster flow end-to-end + against the step-4 fake Kestrel server, *without* any + Bluesky-specific code in Foragent's codebase (only the shared browser + primitives and the LLM planner). +- Runs on 3+ more varied form shapes in tests. +- `BlueskySitePoster` still passes its existing regression tests — + v0.2 does not break v0.1. + +### Step 7 — Learning substrate wired + +**Goal:** prove the framework's skills + memory is the right substrate +for site knowledge, and that retrieval-primed generalist runs beat +unaided runs. + +**Deliverables:** +- `builder.WithSkills()` + `builder.WithLongTermMemory()` added to + `Foragent.Agent/Program.cs`. +- `/data/foragent` volume in `docker-compose.yml` with + `AgentProfile__BasePath=/data/foragent`. +- `BrowserTaskCapability` queries both stores pre-plan; retrieved + content primes the planner. Query shape: intent + target host + + top-K. +- On task success, planner writes one skill per distinguishable flow + (login / action / success signal) keyed by host, tagged + `metadata.source=agent-learned`. +- `IEmbeddingGenerator` wired (Azure OpenAI text-embedding-3-small or + similar) for semantic retrieval. Falls back to BM25-only if not + configured. +- Seed one human-authored skill for `bsky.app` as a priming example; + check in as `deploy/skills-seed/sites/bsky.app/overview.md`. +- Integration test: cold run vs. primed run; assert primed run uses + fewer LLM steps. + +**Framework observations to capture:** +- Does `ISkillStore`'s markdown-content shape fit procedural site + knowledge, or does it strain? +- Does memory's category/tag/importance model fit site observations? +- Any gaps in retrieval (e.g. no host-prefix query shape) → file + rockbot issues. + +**Exit criteria:** +- Primed `browser-task` runs on the same task consistently use ≥30% + fewer planner LLM calls than the unprimed baseline from step 6. +- Agent-learned skills are readable and actionable when inspected by a + human. + +### Step 8 — Multi-phase: form learn + batch execute + +**Goal:** first-class support for the motivating scenario — introspect +a form, return a reviewable schema, later submit a batch against it. + +**Deliverables:** +- New `LearnFormSchemaCapability` with skill id `learn-form-schema`. + - Input: `{url, credentialId?, allowedHosts}`. + - Output: typed JSON schema — fields (name, type, visibility + rules), dropdown options, dependency graph, submit button locator. + - Persists the schema alongside a skill (open question #6 — decide + at step start). +- New `ExecuteFormBatchCapability` with skill id `execute-form-batch`. + - Input: `{url, credentialId?, schemaId | schema, rows[], + allowedHosts, onError: "abort"|"continue"}`. + - Streams A2A status updates per row. + - Returns a per-row result array on completion. +- Integration test: Kestrel-hosted form with a dynamic dropdown + (e.g. `category=alpha` reveals fields A/B, `category=beta` reveals + fields C/D). Schema round-trips; batch of 20 mixed rows submits. +- Open question #8 decided (in the deliverable, not as prose): + per-row continue-vs-abort, progress shape. + +**Exit criteria:** +- Schema learned in one task, batch submitted in a separate task + (different process invocations) against the persisted schema. +- Schema is human-reviewable: a developer can read it and understand + what Foragent will submit before consenting. + +### Step 9 — Deprecate subsumed specialists + +**Goal:** land on v0.2's actual advertised capability set. + +**Deliverables:** +- `post-to-site` removed from advertised `ForagentCapabilities.Skills` + and from `deploy/rockbot-seed/well-known-agents.json` and + `agent-trust.json` `approvedSkills`. Implementation stays in the + codebase as a regression test for credential handling; integration + tests remain green. +- `monitor-page` and `fill-form` from v0.1 §5.1 never shipped; remove + from spec. +- Review: do `fetch-page-title` and `extract-structured-data` still + pay their way? Measure runtime cost vs. equivalent `browser-task` + calls. Remove if `browser-task` is competitive; keep if they're 10×+ + cheaper on the hot path. +- Spec v0.2 merged into `foragent-specification.md`; this proposal + file archived to `docs/archive/`. + +**Exit criteria:** +- Advertised capability list matches §5.2 of spec v0.2. +- No codepath is exercised only by deprecated specialists — every + line has a live caller or a live test. + +--- + +## Part 4 — Open questions for you before step 6 starts + +1. **Generalist action set.** Start with `{snapshot, navigate, click, + type, wait_for, done, fail}` (aligning with what `@playwright/mcp` + exposes and what Playwright's ref-resolver supports natively), or + broader (`hover, select, keyboard_shortcut, file_upload`)? I'd start + narrow and grow on demand. **Your call?** + +2. **Planner LLM.** Use the same `IChatClient` as + `extract-structured-data` (Azure AI Foundry `gpt-5.3-chat`), or + wire a separate one? Separate would let us route planner ≠ + extraction cost-optimally. I'd start same, split if cost forces it. + +3. **Per-task budget default.** Propose: `maxSteps=30`, + `maxSeconds=120`, caller can raise within bounds. **OK, or do you + want these higher/lower?** + +4. **Allowlist default.** Refuse navigation if `allowedHosts` is empty, + or treat empty as "same-origin as `url`"? I lean refuse — forces + callers to be explicit, cheap to construct. + +5. **RockBot side.** Foragent's `invoke_agent` experience at step 5 + showed RockBot's tool only passes free-text. Does step 6's + `browser-task` fit that shape naturally (intent *is* free text), so + the problem dissolves? I think yes — worth confirming before + building. + +6. **Spec merge timing.** Merge this proposal into the main spec now + (it becomes the v0.2 spec and the project operates under it), or + keep it as a proposal until step 6 validates the core approach? + I lean: merge §5 + §9 now, leave §5.6 / §5.7 as "proposed" until + step 7 actually exercises them. diff --git a/docs/foragent-specification.md b/docs/foragent-specification.md index f67100e..4d105d2 100644 --- a/docs/foragent-specification.md +++ b/docs/foragent-specification.md @@ -1,30 +1,39 @@ # Foragent — Project Specification -> **Status:** Design specification, pre-implementation. +> **Status:** Governing specification, v0.2. > **Date:** April 2026 > **Author:** Rocky Lhotka, Marimer LLC -> **Repository (planned):** https://github.com/MarimerLLC/Foragent +> **Repository:** https://github.com/MarimerLLC/foragent --- ## 1. Summary -**Foragent** is an A2A-native browser agent for .NET. It exposes browser -automation capabilities — navigate, extract, fill forms, post to sites, -monitor pages — over the Agent2Agent (A2A) protocol. Other agents delegate -browser work to Foragent rather than reasoning about DOM selectors, session -state, or 2FA flows themselves. +**Foragent** is an A2A-native, self-hosted **agentic browser agent** for +.NET. Callers delegate free-form browser intent to Foragent — "submit +these rows to this form," "post this content on this site," "extract +this data from these pages" — and Foragent plans and drives the browser +to fulfill it, using internal LLM reasoning to resolve selectors, +interpret dynamic form structure, and recover from failure. Callers +do not reason about DOM, selectors, session state, or 2FA. Foragent is built on the **RockBot framework** (the NuGet packages -maintained at https://github.com/MarimerLLC/rockbot) and uses the official -**Microsoft.Playwright** NuGet package for browser automation. It is the -second consumer of the RockBot framework, after the RockBot personal agent -itself. +maintained at https://github.com/MarimerLLC/rockbot) and uses the +official **Microsoft.Playwright** NuGet package for browser automation, +driven directly in-process (no MCP sidecar, no Stagehand port — see +Appendix A decision #16). It is the second consumer of the RockBot +framework, after the RockBot personal agent itself. + +Foragent's product is **one generalist capability** (`browser-task`) +that handles the long tail of browser work, complemented by a small set +of narrow fast-path specialists where a structured typed interface pays +for itself. Site-specific code is the exception, not the scaling path +(see §5). Foragent is a standalone open-source project under Marimer LLC. RockBot -is its first user, but the project is designed to be generally useful to -anyone building agentic systems on .NET that need a self-hosted browser -worker. +is its first user, but the project is designed to be generally useful +to anyone building agentic systems on .NET that need a self-hosted +browser worker. --- @@ -168,6 +177,31 @@ as the base image. The agent process calls Microsoft.Playwright directly. - Concurrency-within-pod can be added later if profiling shows it's needed. +### 3.7 LLM tier routing + +Foragent uses the RockBot framework's `TieredChatClientRegistry` +(`RockBot.Llm`, exposed via `AddRockBotTieredChatClients`). The registry +provides three `IChatClient` instances — `Low`, `Balanced`, `High` — +and registers the `Balanced` client as the default `IChatClient` +singleton for consumers that inject without a tier hint. + +- **Capabilities request a tier appropriate to the work.** The generalist + planner loop (§5.2) targets `Balanced` for planning steps and may + request `High` for recovery from ambiguous states or complex reasoning. + Cheap structural operations (aria-snapshot summarization, extraction + shaping) target `Low` when a Low model is meaningfully cheaper. +- **For v0.2 Foragent ships with one configured model.** Operators wire + the same `IChatClient` into all three tiers; future cost-optimization + upgrades swap models per-tier without touching capability code. +- **Consumers that inject `IChatClient` directly continue to work.** They + transparently receive the `Balanced` client — the framework guarantees + this. No capability is required to be tier-aware; it's an opt-in + optimization surface. + +Direct injection of a single `IChatClient` (as used by v0.1's +`ExtractStructuredDataCapability`) remains supported and backwards- +compatible with the tiered registration. + --- ## 4. Project structure @@ -250,40 +284,162 @@ MIT license. Matches CSLA and the broader .NET OSS ecosystem. The capability list is the product. Foragent's value is what verbs it exposes via A2A, not what's inside. -### 5.1 Initial capability set (v0.x) - -Start narrow. Add only when usage demands it. - -| Capability | Description | -|------------|-------------| -| `fetch-page-content` | Navigate to a URL, return rendered text and structured page metadata. | -| `extract-structured-data` | Navigate to a URL, extract data matching a description (e.g. "the product price and availability"). | -| `fill-form` | Navigate to a URL, fill out a form given a description of the values, submit. | -| `post-to-site` | Authenticate against a configured site (using credential broker) and post content. First targets: Bluesky, Mastodon. | -| `monitor-page` | Periodically check a page for changes matching a description; emit A2A progress updates when changes occur. | - -### 5.2 Capabilities explicitly out of scope (v1) - -- Test automation (Playwright already does this) -- Raw browser primitive exposure (Microsoft's playwright/mcp does this) -- Visual regression testing -- Form-filling for sensitive financial transactions or account creation - (see Section 7.3) -- Multi-tab orchestration as a primary feature (may be supported - internally but not advertised as a capability) - -### 5.3 Capability design principles - -- Each capability has a **clear, named contract** — inputs, outputs, - error modes documented. -- Capabilities are **task-level, not action-level**. "Post to site" is - a capability; "click button" is not. -- Capabilities **may delegate to internal LLM reasoning** for - selector resolution, intent translation, and retry logic. This is - what makes Foragent an *agent* rather than a wrapper. -- Capabilities **respect the credential broker contract**. They - reference credentials by ID; they never receive raw values from - callers. +### 5.1 Capability model + +Foragent exposes capabilities at two tiers: + +1. **Generalist.** One capability — `browser-task` — that accepts + free-form intent plus optional URL and credential hints. Runs an + LLM-in-the-loop planner over the browser primitives, using any + learned site knowledge from the skills and memory stores (§5.6) as + priming. This is the default surface — the thing most callers should + invoke. +2. **Fast-path specialists.** A small set of narrow, structured + capabilities that do one well-defined thing cheaply and + deterministically. `fetch-page-title` and `extract-structured-data` + are specialists. New specialists are added only when usage shows a + consistent, high-volume pattern that benefits from a typed interface. + +Most real callers are themselves LLM agents. They default to the +generalist. Specialists exist to keep deterministic, programmatic +callers cheap — not to proliferate. + +### 5.2 Initial capability set (v0.2) + +| Capability | Tier | Description | +|------------|------|-------------| +| `browser-task` | Generalist | Given intent + optional URL, credential id, and allowed-hosts list, plan and drive the browser to fulfill the intent. Uses RockBot skills + memory as priming. Returns a result or a structured intermediate artifact (e.g. a learned form schema). | +| `learn-form-schema` | Specialist (phase-1) | Given a URL and optional credential, introspect a form and return its schema — fields, types, dropdown dependencies, validation rules. Persists the schema as a skill (§5.6). Returns the schema to the caller for review. | +| `execute-form-batch` | Specialist (phase-2) | Given a learned schema (by id or inline) and a batch of row data, submit the form once per row. Streams A2A progress updates. Handles partial failure. | +| `fetch-page-title` | Specialist | Return the `<title>` of a URL. Inherited from milestone 2. | +| `extract-structured-data` | Specialist | Extract structured data from a page matching a natural-language description. Inherited from milestone 3. | + +The v0.1 `post-to-site` capability ships in the main codebase as a +regression test for credential handling. After step 7 it is removed +from the advertised skill list; `browser-task` subsumes its function. + +The v0.1 `monitor-page` and `fill-form` capabilities fold into +`browser-task` and do not ship as separate advertised skills. + +### 5.3 Capabilities explicitly out of scope (v1) + +- Test automation (Playwright already does this). +- Raw browser primitive exposure (Microsoft's `@playwright/mcp` does + this; Foragent operates one level up — task-shaped, not tool-shaped). +- Visual regression testing. +- Form-filling for sensitive financial transactions, account creation, + or modifying security permissions (see §7.3). +- Multi-tab orchestration as a primary feature (may be used internally + but not advertised). +- Code generation from browser traces (e.g. "generate a Playwright + script that reproduces this"). Traces stay inside the learning + substrate. + +### 5.4 Capability design principles + +- **Task-level, not action-level.** "Submit these rows to that form" + is a capability; "click button" is not. +- **Clear contracts even for the generalist.** `browser-task`'s input + shape is typed (intent, url?, credentialId?, allowedHosts, maxSteps?); + only the *plan* inside is LLM-generated. +- **Return structured state, not narrative, when the caller needs to + act on it.** A learned form schema is typed JSON, not prose. A + submit-batch progress report is a typed status update, not a sentence. +- **Delegate to the learning substrate, don't reinvent it.** Site + knowledge lives in RockBot skills + memory; the capability reads and + writes, it does not own its own cache. +- **Credentials by reference.** Capabilities receive a credential id; + the broker (§6) resolves inside the Foragent process. + +### 5.5 Multi-phase flows + +Many real browser tasks are multi-phase with human or caller-side +review between phases. The motivating example: + +1. **Phase 1 — Learn.** Navigate to a form; introspect its fields and + dynamic dependencies; return a schema to the caller. +2. **Review.** The caller (human via their own UI, or another agent) + inspects the schema, decides whether to proceed, assembles input + data, validates. +3. **Phase 2 — Execute.** Submit the form N times against the learned + schema, streaming progress. + +Foragent's role is Phase 1 and Phase 3. Phase 2 (review) is the +caller's responsibility — Foragent is not in the review loop. + +To make this work: + +- Phase-1 capabilities **return structured artifacts** (form schemas, + extracted data, observed flow traces), not just status text. +- Phase-1 artifacts are **persisted in the learning substrate** (§5.6) + and get an id the caller can reference in Phase 3. +- Phase-3 capabilities **accept a learned-artifact reference or inline + artifact** as input, alongside per-invocation data. +- Phase-3 capabilities **stream progress and handle partial failure** + over A2A — not batch-atomic. + +This is not an A2A protocol change. A2A 1.0 already supports structured +response parts, streaming status updates, and task-id references; v0.2 +makes explicit use of all three. + +### 5.6 Learning substrate + +Foragent uses the RockBot framework's existing persistence for learned +site knowledge, rather than building a Foragent-local store. + +**What's used:** + +- **`ISkillStore`** (file-backed, BM25 + optional semantic retrieval — + `RockBot.Host.Abstractions` + `RockBot.Host.AgentMemoryExtensions.WithSkills()`). + Stores site knowledge as markdown skills. Two origin categories: + - **Human-authored skills** — operator-written primers for a site + (e.g. `sites/bsky.app/overview`). Treated as priming hints for the + generalist planner. + - **Agent-learned skills** — written by the generalist on successful + task completion (e.g. `sites/bsky.app/learned/login-flow`). Tagged + with `metadata.source = "agent-learned"` and an importance score. +- **`ILongTermMemory`** (file-backed, BM25 + semantic — + `WithLongTermMemory()`). Declarative observations that don't fit the + procedural skill shape: failed attempts, site-version notes, ambient + facts. + +**Skill naming:** `sites/{host}/{phase-or-intent}` — e.g. +`sites/bsky.app/login`, `sites/bsky.app/compose-post`. Hierarchical `/` +nesting is supported by the store. `seeAlso` links cross-reference +skills for the same site so retrieval surfaces a small knowledge +cluster, not one skill at a time. + +**Retrieval at plan time:** + +1. Capability computes a search query from task intent + target URL host. +2. Queries skill store and memory store in parallel, top-K by relevance. +3. Retrieved content becomes priming context for the LLM planner. +4. New observations surface as writes after the task completes. + +**Structured artifacts (the form-schema case):** + +Learned form schemas are typed JSON, not markdown. Skill store holds +markdown content. Resolution deferred to step 8; current options are +(A) embed JSON in a fenced code block inside a skill, re-parse on +retrieval, or (B) add a parallel Foragent-local typed store keyed by +skill id. Framework-feedback tracks this as a candidate +`ISkillStore.AttachedArtifacts` extension if the shape recurs. + +### 5.7 Human-in-the-loop + +Review gates are the **caller's** responsibility, not Foragent's. + +- Foragent returns structured state at phase boundaries (§5.5). +- The caller decides whether to proceed. Human callers use their own + UI; agent callers make the decision programmatically. +- Foragent does **not** block waiting for review. Each phase is a + separate A2A task. + +A2A's `input-required` state is used only for mid-task credential +flows (2FA, §6.6). It is not used as a general "stop and let the human +review" mechanism — that coupling would force Foragent to hold browser +state across potentially-long human delays, which conflicts with the +one-context-per-task model (§3.5). --- @@ -385,9 +541,24 @@ flow is the recommended pattern. ### 7.1 Domain allowlists -Per-task allowlists for navigable domains. The calling agent can -constrain a task to specific origins; Foragent refuses navigation -outside the allowlist. Default is restrictive, not permissive. +Every capability invocation — especially the generalist `browser-task` +(§5.2) — **must** carry an explicit allowed-hosts list. Empty list +**rejects** the task; there is no default-permissive mode. + +Wildcards are supported to keep callers from having to enumerate every +subdomain: + +- Exact host: `bsky.app` +- Subdomain wildcard: `*.example.com` (matches `foo.example.com`, + `foo.bar.example.com`; does not match `example.com` itself — list + both if both are desired). +- Fully unrestricted: `*` (explicit only; still callable, still logged). + +Foragent refuses any navigation, fetch, or subframe load outside the +list before Playwright sees the URL. Per-tenant defaults (future, §7.5) +will let individual tasks inherit rather than list everything on every +call. Ad-hoc "navigate to whatever looks relevant" is explicitly not +supported — the generalist is powerful but bounded. ### 7.2 Network egress policies @@ -505,42 +676,65 @@ hard design questions until usage forces them. ### 9.1 Milestones -1. **Empty agent on RockBot framework.** Stand up Foragent.Agent that - registers itself as an A2A server with one trivial capability - (`fetch-page-title`). No Playwright yet. Goal: feel the bootstrap - cost of building a new agent on RockBot. - -2. **Real Playwright integration for that capability.** Add - Microsoft.Playwright NuGet, implement `fetch-page-title` for real - against actual web pages. Goal: feel the integration story between - RockBot's agent loop and the Playwright library. - -3. **Add a second capability** (`extract-structured-data`). Goal: feel - how the framework supports growing the capability surface. - -4. **Add credentials and a third capability that needs them** - (`post-to-site` for Bluesky). Goal: end-to-end credential broker - story including ICredentialBroker abstraction and at least one - real implementation. - -5. **Wire RockBot the agent up to call Foragent via A2A.** Goal: - validate the full loop. RockBot becomes Foragent's first real user. - -Each milestone produces framework feedback. Capture it. Some will be -small ergonomic fixes; some may be "the framework should really have a -concept of X." +**Steps 1–5 — shipped (v0.1):** + +1. **Empty agent on RockBot framework.** `fetch-page-title` with no + Playwright. +2. **Real Playwright integration for that capability.** +3. **Second capability** — `extract-structured-data` (Playwright + LLM). +4. **Credentials and `post-to-site` for Bluesky.** `ICredentialBroker` + + `InMemoryCredentialBroker` + `BlueskySitePoster`. +5. **RockBot wired to Foragent via A2A.** Validation loop; RockBot + becomes Foragent's first real user. + +**Steps 6–9 — v0.2 sequence:** + +6. **Baseline `browser-task` generalist.** LLM-in-the-loop planner built + directly on `Microsoft.Playwright` NuGet (no MCP sidecar, no + Stagehand — see Appendix A #16). Exposes a small `[AIFunction]` + tool set — `snapshot`, `click`, `type`, `navigate`, `wait_for`, + `done`, `fail` — through `IChatClient`. Uses `Page.AriaSnapshotAsync()` + ref-annotated snapshots and `Page.Locator("aria-ref=eN")` for ref + resolution. No learning substrate yet. Measure unaided success rate + on a small curated benchmark. Goal: establish the floor before + investing in priming. + +7. **Wire RockBot skills + memory as priming.** Register `ISkillStore` + + `ILongTermMemory` in Foragent's host. Retrieve relevant skills + into planner context; write agent-learned skills on success. Seed + one human-authored skill for `bsky.app`. Wire `IEmbeddingGenerator` + for semantic retrieval. Remove `post-to-site` from the advertised + skill list once `browser-task` + the learned bsky skill cover it. + Goal: prove the framework's persistence is the right substrate; + file issues if it isn't. + +8. **`learn-form-schema` + `execute-form-batch`.** First explicit + multi-phase capability pair. Structured JSON schema returned from + phase 1, batch execution with streaming per-row progress in phase 2. + Resolve open question #6 (how to persist typed JSON alongside + markdown skills) in the deliverable. + +9. **Deprecate subsumed specialists.** Review whether `fetch-page-title` + / `extract-structured-data` still pay their way or fold into + `browser-task` with equivalent cost. Land on the minimum advertised + capability set v0.2 actually needs. + +Each milestone produces framework feedback. Capture it in +`docs/framework-feedback.md` — some will be small ergonomic fixes; some +may be "the framework should really have a concept of X." ### 9.2 What is explicitly out of scope for v1 -- Container packaging beyond a single working Dockerfile -- Helm charts and production k8s manifests -- KEDA autoscaling integration -- Multi-tenant credential broker UIs -- Agent self-improvement / learning -- Browser pool management -- Stagehand-style natural-language-to-action layers (may be revisited - later; the internal LLM-based selector resolution is sufficient for - v1) +- Container packaging beyond a single working Dockerfile. +- Helm charts and production k8s manifests. +- KEDA autoscaling integration. +- Multi-tenant credential broker UIs. +- Browser pool management (single shared Chromium per pod — §3.5). +- Non-browser automation (desktop, mobile, API-only flows). + +(The v0.1 "no Stagehand-style natural-language-to-action layers" item +is deliberately removed. v0.2's `browser-task` *is* that layer, built +natively on Playwright NuGet — see Appendix A #16.) --- @@ -600,13 +794,12 @@ identifier for .NET. These are real design questions deferred until usage forces an answer. -1. **Internal LLM selection and tier routing.** Foragent will use - Microsoft.Extensions.AI for internal reasoning. Which tier routing - patterns from RockBot apply directly, and which are RockBot-specific? -2. **Stagehand-equivalent for .NET.** Stagehand is Node-only. Should - Foragent build an equivalent natural-language `page.act()` layer in - C# using its internal LLM? Defer to v2 unless v1 selector-resolution - proves insufficient. +1. ~~Internal LLM selection and tier routing.~~ **Closed** in v0.2 + §3.7 — Foragent uses RockBot's `TieredChatClientRegistry`; ships + with one model aliased across tiers; capabilities are tier-aware. +2. ~~Stagehand-equivalent for .NET.~~ **Closed** in v0.2 — built + natively on `Microsoft.Playwright` NuGet; no Stagehand port, no + `@playwright/mcp` sidecar. See Appendix A #16. 3. **Storage state encryption at rest.** Storage state is sensitive but not as sensitive as raw credentials. Does it need stronger protection than the credential broker provides, or is broker-level fine? @@ -615,6 +808,17 @@ These are real design questions deferred until usage forces an answer. Defer until a capability actually needs to change shape. 5. **Tenant identity model.** A2A 1.0-preview's identity model is still evolving. Lock in the tenant identity story once A2A 1.0 stabilizes. +6. **Structured artifacts in `ISkillStore`.** Learned form schemas + (§5.6) are typed JSON; skills store markdown. Stretch the skill + shape (fenced JSON, re-parse on retrieval) or add a parallel + Foragent-local typed store keyed by skill id? Decide at step 8. +7. **Per-task budget.** How do we cap an LLM-in-the-loop task — max + steps, max tokens, wall-clock, cost? Proposed defaults: + `maxSteps=30`, `maxSeconds=120`, caller can raise within bounds. + Needed by step 6. +8. **Retry and failure semantics for batches.** In `execute-form-batch`, + is a row failure fatal or per-row? How are partial results streamed? + Needed by step 8. --- @@ -640,3 +844,8 @@ need to be revisited. | 13 | MIT license | Matches CSLA and the broader .NET OSS ecosystem. | | 14 | .NET 10, C# latest | Current stable .NET as of project start. | | 15 | Name: Foragent | Distinctive, self-explaining, available domains, no dev-tools collision. | +| 16 | Build generalist `browser-task` on Microsoft.Playwright NuGet directly — no Stagehand port, no `@playwright/mcp` sidecar | Ref-annotated aria snapshots and `aria-ref=eN` locator resolution are Playwright features, not MCP-exclusive. `[AIFunction]` tool wrapping over `IChatClient` gives MCP-equivalent function-calling in-process. Keeps credential boundary (§6.1) clean and preserves v0.1 decision #1. | +| 17 | Use RockBot's `TieredChatClientRegistry` (Low/Balanced/High) with Balanced as the injected default | Future cost-optimization can route cheaper classes of work (extraction, snapshot summarization) to Low without capability rewrites. v0.2 ships with one model aliased across tiers. | +| 18 | Allowlists are mandatory per-task with wildcard support (`*.example.com`, `*`) | Generalist LLM-in-the-loop planner has much wider blast radius than fixed-flow specialists; empty list must reject. Wildcards keep callers from enumerating subdomains. | +| 19 | Learned site knowledge lives in RockBot's `ISkillStore` + `ILongTermMemory`, not a Foragent-local store | Framework-owned persistence is already packable, DI-registerable, and has BM25+semantic hybrid retrieval with importance weighting. Building parallel infrastructure would be duplicate work and would miss the framework-validation goal (§8). | +| 20 | Multi-phase flows (learn → review → execute) are expressed as separate A2A tasks, not one long-running task with `input-required` | Review gates are the caller's concern; Foragent would otherwise have to hold browser state across arbitrary human delays, breaking the one-context-per-task isolation model (§3.5). |