From 9fac64e827f084393a46fe2846e282f516a2b392 Mon Sep 17 00:00:00 2001 From: Rockford Lhotka Date: Wed, 22 Apr 2026 21:37:15 -0500 Subject: [PATCH] Step 7.5: daily dream loop with task-shaped directives MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wire agent.AddScheduling() + agent.WithDreaming() in Program.cs. Enable five subtypes relevant to a browser worker (main orchestrator, skill-optimize, skill-gap, sequence-skill, memory-mining); disable the eight personality-agent subtypes (preference, episode, tier-routing, entity, graph-consolidation, identity, DLQ, Wisp). RockBot ships no default directive content — intentionally, since the framework can't know what any given agent needs. Foragent authors its own five directives under src/Foragent.Agent/directives/, shipped via CopyToOutputDirectory so they land at /app/directives/ in the container. DreamService resolves each directive relative to AgentProfileOptions.BasePath, which Program.cs configures to "directives" (resolved against AppContext.BaseDirectory for relative values — confirmed by IL inspection). Dreams are opt-in: ForagentDreams:Enabled defaults false so `dotnet run` smoke tests don't burn tokens; docker-compose.yml sets it true for the full harness. Cron defaults to daily 03:00 UTC — framework default of every 12h is too frequent. ProtectedSkillPrefixes stays empty deliberately so operator primers get improved rather than frozen. framework-feedback.md step-7.5 entry captures the directive intentionality, the AgentProfileOptions.BasePath resolution path, and a candidate companion-package offering (Directives.Personality / Directives.Task) that would reduce onboarding cost without compromising the no-hardcoded-content principle. Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 19 ++++- docker-compose.yml | 6 ++ docs/framework-feedback.md | 75 +++++++++++++++++++ src/Foragent.Agent/Foragent.Agent.csproj | 7 ++ src/Foragent.Agent/Program.cs | 60 +++++++++++++++ src/Foragent.Agent/directives/dream.md | 57 ++++++++++++++ .../directives/memory-mining.md | 75 +++++++++++++++++++ .../directives/sequence-skill.md | 75 +++++++++++++++++++ src/Foragent.Agent/directives/skill-gap.md | 68 +++++++++++++++++ .../directives/skill-optimize.md | 68 +++++++++++++++++ 10 files changed, 508 insertions(+), 2 deletions(-) create mode 100644 src/Foragent.Agent/directives/dream.md create mode 100644 src/Foragent.Agent/directives/memory-mining.md create mode 100644 src/Foragent.Agent/directives/sequence-skill.md create mode 100644 src/Foragent.Agent/directives/skill-gap.md create mode 100644 src/Foragent.Agent/directives/skill-optimize.md diff --git a/CLAUDE.md b/CLAUDE.md index a5f2a9b..ffb3e72 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Status -Foragent is at **milestone 7 shipped, step 8 next**. Three capabilities are advertised (`browser-task`, `fetch-page-title`, `extract-structured-data`); the A2A loop is wired end-to-end against RockBot via the `docker-compose.yml` harness pinned to `rockylhotka/rockbot-agent:0.8.5`. Step 6 shipped the generalist `browser-task` planner (LLM-in-the-loop over ref-annotated aria snapshots + `aria-ref=eN` locator resolution, built on `Microsoft.Playwright` 1.59 — bumped from 1.50 for the Ai aria-snapshot mode; see Appendix A #16). Tiered chat clients are wired via `AddRockBotTieredChatClients` with one model aliased across Low/Balanced/High per spec §3.7. Step 7 wired the learning substrate: `ISkillStore` + `ILongTermMemory` via `WithSkills()` + `WithLongTermMemory()`, `BrowserTaskPriming` injects retrieved skill + memory content into the planner prompt, successful tasks write a learned skill at `sites/{host}/learned/{slug}`, and `BskySeedSkillService` seeds `sites/bsky.app/login` on first start (idempotent — only writes when absent). Embeddings are optional and configured separately under `ForagentEmbeddings` so they can live on a different Azure Foundry subscription than the chat model; missing embeddings downgrade retrieval to BM25-only with a single startup warning. The step-6 unaided benchmark (3/3) still passes after the priming wiring. `post-to-site` has been removed from both the advertised skill list and the codebase (greenfield deletion — `browser-task` + the learned bsky skill cover the use case). The governing spec is `docs/foragent-specification.md` **v0.2**. Storage-state persistence, 2FA input-required flow, k8s-secrets broker, and per-tenant credential namespaces remain deferred — tracked in `docs/framework-feedback.md`. Framework-level observations from each milestone are captured in `docs/framework-feedback.md`. +Foragent is at **milestone 7.5 shipped, step 8 next**. Three capabilities are advertised (`browser-task`, `fetch-page-title`, `extract-structured-data`); the A2A loop is wired end-to-end against RockBot via the `docker-compose.yml` harness pinned to `rockylhotka/rockbot-agent:0.8.5`. Step 6 shipped the generalist `browser-task` planner (LLM-in-the-loop over ref-annotated aria snapshots + `aria-ref=eN` locator resolution, built on `Microsoft.Playwright` 1.59 — bumped from 1.50 for the Ai aria-snapshot mode; see Appendix A #16). Tiered chat clients are wired via `AddRockBotTieredChatClients` with one model aliased across Low/Balanced/High per spec §3.7. Step 7 wired the learning substrate: `ISkillStore` + `ILongTermMemory` via `WithSkills()` + `WithLongTermMemory()`, `BrowserTaskPriming` injects retrieved skill + memory content into the planner prompt, successful tasks write a learned skill at `sites/{host}/learned/{slug}`, and `BskySeedSkillService` seeds `sites/bsky.app/login` on first start (idempotent — only writes when absent). Embeddings are optional and configured separately under `ForagentEmbeddings` so they can live on a different Azure Foundry subscription than the chat model; missing embeddings downgrade retrieval to BM25-only with a single startup warning. The step-6 unaided benchmark (3/3) still passes after the priming wiring. `post-to-site` has been removed from both the advertised skill list and the codebase (greenfield deletion — `browser-task` + the learned bsky skill cover the use case). The governing spec is `docs/foragent-specification.md` **v0.2**. Storage-state persistence, 2FA input-required flow, k8s-secrets broker, and per-tenant credential namespaces remain deferred — tracked in `docs/framework-feedback.md`. Framework-level observations from each milestone are captured in `docs/framework-feedback.md`. ## Build / test @@ -103,7 +103,22 @@ On successful completion (`state.IsDone`), `BrowserTaskCapability.TryWriteLearne `BskySeedSkillService` (IHostedService) seeds `sites/bsky.app/login` on first start by calling `ISkillStore.GetAsync` and only writing if absent — docker volume recreation reseeds cleanly; operator edits to the skill through other channels are preserved. -Skill naming follows spec §5.6: `sites/{host}/{intent}` for human-authored primers, `sites/{host}/learned/{slug}` for agent-generated. `Skill.SeeAlso` cross-references related skills to surface clusters rather than single entries. **Note:** `Skill` (from `RockBot.Host 0.8.5`) does not carry tags, metadata, or importance — the `agent-learned` distinction is encoded in the name prefix only. +Skill naming follows spec §5.6: `sites/{host}/{intent}` for human-authored primers, `sites/{host}/learned/{slug}` for agent-generated. `Skill.SeeAlso` cross-references related skills to surface clusters rather than single entries. **Note:** `Skill` (from `RockBot.Host 0.8.5`) does not carry tags, metadata, or importance — the `agent-learned` distinction is encoded in the name prefix only. The dream loop (below) keeps the distinction from mattering at retrieval time: skills get improved, merged, and deduped across origins on a daily cadence. + +## Dream loop (step 7.5) + +Foragent runs a daily RockBot dream pass to consolidate accumulated skills and memory. Wired via `agent.AddScheduling()` + `agent.WithDreaming(opts)` inside `AddRockBotHost`. Five subtypes are enabled, eight are off: + +- **Enabled:** main orchestrator (`dream.md`), skill-optimize (merge/dedup), skill-gap (detect missing coverage), sequence-skill (detect repeated tool patterns), memory-mining (promote durable observations to `ILongTermMemory`). +- **Disabled:** preference inference, episode extraction, tier-routing review, entity extraction, graph consolidation, identity reflection, DLQ review, Wisp failure analysis. All personality-agent territory. + +`ProtectedSkillPrefixes = []` — empty on purpose. Operator primers like `sites/bsky.app/login` are *improved in place* by the dream, not frozen; the seed service only writes on a cold boot, so later dream-authored improvements survive restarts. Operators who need to reset a primer can delete the stored skill file and bounce the host. + +Directive files live at `src/Foragent.Agent/directives/*.md` and ship with the binary via ``. `DreamService` resolves each `DreamOptions.*DirectivePath` relative to `AgentProfileOptions.BasePath` (confirmed by IL inspection — relative base paths combine against `AppContext.BaseDirectory`, which is the binary output dir). Program.cs configures `AgentProfileOptions.BasePath = "directives"`; no `WithProfile()` call, Foragent doesn't need the personality-profile doc set. + +Dreams are **opt-in** via `ForagentDreams:Enabled`. `appsettings.json` defaults false so `dotnet run` smoke tests don't trigger scheduled LLM calls; `docker-compose.yml` sets `ForagentDreams__Enabled=true` because that's the "full operating mode" shape. `CronSchedule` defaults to `0 3 * * *` (03:00 UTC daily) — the framework default of every 12 hours is too frequent for a browser worker. `InitialDelay` is the framework default (5 minutes from start), which is fine in prod but worth noting if someone spins up the compose harness for a 10-minute smoke session. + +**Don't add directive content to the RockBot agent's `deploy/rockbot-seed/` set.** Foragent's directives are task-shaped (browser outcomes, site knowledge); RockBot's are personality-shaped (identity, preferences). Mixing them defeats the reason Foragent authored its own. ## Credentials diff --git a/docker-compose.yml b/docker-compose.yml index 2f0b7e1..bd170e4 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -81,6 +81,12 @@ services: # mounted volume below so learned site knowledge survives restarts. ForagentMemory__SkillsPath: /data/foragent/skills ForagentMemory__MemoryPath: /data/foragent/memory + # Step 7.5: daily dream pass to consolidate accumulated skills + + # memory. Disabled by default in appsettings.json so `dotnet run` smoke + # tests don't burn tokens; opt-in here because the compose harness is + # the "full operating mode" shape. CronSchedule default is 03:00 UTC + # daily — override via ForagentDreams__CronSchedule if needed. + ForagentDreams__Enabled: "true" # Optional Bluesky credential used by future credentialed browser-task # runs. Flat id (no slashes) because env-var keys use __ to separate # config segments. Leave unset to disable. diff --git a/docs/framework-feedback.md b/docs/framework-feedback.md index dcc86a8..08915c8 100644 --- a/docs/framework-feedback.md +++ b/docs/framework-feedback.md @@ -409,3 +409,78 @@ scenarios still pass on first attempt — the priming wiring itself adds no overhead when the stores return nothing, confirming the fail-soft contract. A separate benchmark with a populated store is step-8-or-later work (need a curated skill set worth priming against). + +## Step 7.5 — dream loop + +### Framework observations + +- **Dream directives don't ship with the framework.** `DreamOptions` + defaults to bare filenames (`dream.md`, `skill-optimize.md`, + `sequence-skill.md`, etc.) that `DreamService` reads at runtime. The + `RockBot.Host`/`RockBot.Host.Abstractions` assemblies carry **zero + embedded resources** — no `.md` defaults, no stub directives. The + RockBot agent ships its directive set inside its docker image + (`/app/agent/*.md`), and `docker-compose.yml`'s `rockbot-init` step + copies them to `/data/agent/`. This is intentional (per operator + guidance: the framework can't know what any given consumer needs), + but it means every new framework consumer carries a ~300-line + directive-authoring cost as a prerequisite to turning on dreams. + Candidate framework offering (not an ask, since the intentionality + is real): optional companion packages like + `RockBot.Host.Directives.Personality` and + `RockBot.Host.Directives.Task` that ship starter directive sets, + selectable by `WithDreaming(opts => opts.UsePersonalityDefaults())` + or similar. Reduces onboarding cost without compromising the + no-hardcoded-content principle. + +- **Directive paths resolve via `AgentProfileOptions.BasePath`.** IL + inspection of `DreamService`'s `ResolvePath` helper confirms: for + each directive (e.g. `opts.SkillOptimizeDirectivePath = + "skill-optimize.md"`), the final path is: + `Path.Combine(basePath, directive)` where `basePath` comes from + `IOptions.Value.BasePath`. If `basePath` is + relative, it combines against `AppContext.BaseDirectory` (binary + output dir). Foragent configures `AgentProfileOptions.BasePath = + "directives"` and ships markdown files alongside the binary via + `CopyToOutputDirectory=PreserveNewest` — no `WithProfile()` call + needed. Worth documenting in RockBot's dream-loop guide: consumers + that don't load a personality profile still need to Configure the + options type because that's the single source of truth for directive + base paths. + +- **`DreamService`'s constructor pulls 17 dependencies.** Everything + the dream subtypes might need (`IConversationLog`, `IDlqSampler`, + `IWispExecutionLog`, `IKnowledgeGraph`, `TierRoutingLogger`, …) is a + hard ctor parameter, so the framework registers stub / no-op + implementations for the ones a given agent doesn't use. Works, but + consumers who turn off a subtype shouldn't need its stores in DI at + all. Candidate framework refactor: make the subtype dependencies + optional (`IEnumerable` or similar) so + `DreamService.StartAsync` enumerates whatever's registered and skips + what isn't. Lower priority than the directives ask. + +- **`ProtectedSkillPrefixes` literal-only.** The list is + `List` and (from the IL) matched via `StartsWith` — no + wildcard expansion. Foragent ships it empty; operators can add + specific literals if they need to freeze a skill. Noting because + wildcard-style patterns (`sites/*/login`) would be a natural + extension and aren't there today. + +### Manual verification plan + +Automated tests for the dream loop would require faking the scheduler +and running an end-to-end pass — out of scope. Verified manually via +docker-compose: + +- Container starts with `ForagentDreams__Enabled=true` → startup log + shows `ForagentDreams enabled; daily dream pass on schedule '0 3 * * + *'`. +- Container starts with dreams disabled → log shows the opposite and + `DreamService` is not registered. +- Directive files present at `/app/directives/*.md` inside the + container (verified via `docker compose exec foragent ls + /app/directives/`). + +First live dream pass against a non-empty skill store will be observed +after enough `browser-task` runs accumulate — probably step 8 or when +the operator turns the harness on for a sustained session. diff --git a/src/Foragent.Agent/Foragent.Agent.csproj b/src/Foragent.Agent/Foragent.Agent.csproj index 42339da..0474bf6 100644 --- a/src/Foragent.Agent/Foragent.Agent.csproj +++ b/src/Foragent.Agent/Foragent.Agent.csproj @@ -25,4 +25,11 @@ + + + + diff --git a/src/Foragent.Agent/Program.cs b/src/Foragent.Agent/Program.cs index db6412c..a0fceef 100644 --- a/src/Foragent.Agent/Program.cs +++ b/src/Foragent.Agent/Program.cs @@ -83,6 +83,21 @@ var skillsPath = memorySection["SkillsPath"] ?? "data/skills"; var memoryPath = memorySection["MemoryPath"] ?? "data/memory"; +// Dream loop (step 7.5). Opt-in via ForagentDreams:Enabled — default off so +// local `dotnet run` smoke tests don't trigger scheduled LLM calls. The +// docker-compose harness sets this to true. CronSchedule defaults to 03:00 +// UTC daily; framework default is every 12 hours, too frequent for a browser +// worker. Directive files ship alongside the binary under ./directives/ — +// DreamService resolves each directive path relative to +// AgentProfileOptions.BasePath (confirmed via IL inspection; relative paths +// combine against AppContext.BaseDirectory). +var dreamsSection = builder.Configuration.GetSection("ForagentDreams"); +var dreamsEnabled = dreamsSection.GetValue("Enabled") ?? false; +var directivesPath = dreamsSection["DirectivesPath"] ?? "directives"; +var dreamsCron = dreamsSection["CronSchedule"] ?? "0 3 * * *"; + +builder.Services.Configure(opts => opts.BasePath = directivesPath); + builder.Services.AddRockBotHost(agent => { agent.WithIdentity(agentName); @@ -106,6 +121,38 @@ agent.WithSkills(opts => opts.BasePath = skillsPath); agent.WithLongTermMemory(opts => opts.BasePath = memoryPath); + if (dreamsEnabled) + { + agent.AddScheduling(); + agent.WithDreaming(opts => + { + opts.Enabled = true; + opts.CronSchedule = dreamsCron; + + // Task-shaped dream subtypes (see directives/dream.md). + opts.SkillGapEnabled = true; + opts.SequenceSkillDetectionEnabled = true; + opts.MemoryMiningEnabled = true; + + // Personality-shaped subtypes — not applicable to a browser + // worker. Disabling these skips both the LLM call and the + // directive-file lookup. + opts.PreferenceInferenceEnabled = false; + opts.EpisodeExtractionEnabled = false; + opts.TierRoutingReviewEnabled = false; + opts.EntityExtractionEnabled = false; + opts.GraphConsolidationEnabled = false; + opts.IdentityReflectionEnabled = false; + opts.DlqReviewEnabled = false; + opts.WispFailureAnalysisEnabled = false; + + // Empty protected list — the goal is that the dream improves + // primer skills over time, not that primers are frozen + // (operator can still edit them through other channels). + opts.ProtectedSkillPrefixes = []; + }); + } + agent.Services.AddForagentCapabilities(); agent.Services.AddHostedService(); }); @@ -157,6 +204,19 @@ + "Set ForagentEmbeddings:Endpoint/ModelId/ApiKey to enable semantic retrieval."); } +if (dreamsEnabled) +{ + app.Logger.LogInformation( + "ForagentDreams enabled; daily dream pass on schedule '{Cron}' will consolidate skills and memory.", + dreamsCron); +} +else +{ + app.Logger.LogInformation( + "ForagentDreams disabled. Learned skills will accumulate without consolidation; " + + "set ForagentDreams:Enabled=true to turn on the daily dream pass."); +} + app.Run(); public partial class Program; diff --git a/src/Foragent.Agent/directives/dream.md b/src/Foragent.Agent/directives/dream.md new file mode 100644 index 0000000..1ac1b1d --- /dev/null +++ b/src/Foragent.Agent/directives/dream.md @@ -0,0 +1,57 @@ +# Foragent dream loop + +You are the dream pass for **Foragent**, a task-level browser-automation +agent built on the RockBot framework. The framework fires this dream on +a daily schedule; your role is to improve the agent's accumulated site +knowledge without any user-facing interaction. + +## What Foragent does + +Foragent exposes one generalist capability (`browser-task`) and two +specialists. Every `browser-task` invocation runs an LLM-in-the-loop +planner over a small tool surface (`snapshot`, `click`, `type`, +`navigate`, `wait_for`, `done`, `fail`) against a real Chromium browser +in an isolated context. Each successful run writes a **learned skill** +at `sites/{host}/learned/{intent-slug}` describing the flow that +worked. Operators may also seed **primer skills** at `sites/{host}/{…}` +as hand-written site guides. + +## What this dream pass is for + +Turn an accumulating pile of single-shot learned skills into a smaller, +better, more retrievable body of site knowledge. Specific passes are +driven by their own directives: + +- `skill-optimize.md` — merge duplicate / overlapping skills for the + same site into a single clearer entry. +- `skill-gap.md` — look at recent failures and propose what skill would + have helped, flagging the gap in long-term memory. +- `sequence-skill.md` — find repeated tool-call patterns across many + runs and propose a canonicalised named sequence. +- `memory-mining.md` — promote durable observations from the tool-call + log into `ILongTermMemory` so they prime future planning. + +Other RockBot subtypes (identity reflection, preference inference, +episode extraction, entity / knowledge-graph consolidation, tier-routing +review, Wisp failure analysis, DLQ review) are disabled for Foragent — +they serve personality-driven agents, not a browser worker. + +## Ground rules for every pass + +- **Do not invent site behaviour.** Every claim in a skill or memory + entry must trace back to tool-call log evidence. "When Bluesky login + fails, retry" is fine only if the trace log shows that pattern. +- **Never include credential values, typed field contents, or tokens.** + The trace log captures field *lengths*, not *values*. If you see any + string that looks like a password / code / token in content you're + producing, stop and strip it. +- **Prefer concrete selectors and landing URLs** ("click the element + labelled `Next`" / "navigate to `/compose`") over vague guidance ("go + to the compose page"). Future planners retrieve these to save + snapshot round-trips. +- **Protected skills** listed in `DreamOptions.ProtectedSkillPrefixes` + must never be deleted and should be *improved* in place rather than + replaced — edit their Content and Summary, keep the Name. +- **Drop data, don't grow it.** A consolidated skill should be *shorter* + than the sum of its sources, or the consolidation isn't earning its + keep. diff --git a/src/Foragent.Agent/directives/memory-mining.md b/src/Foragent.Agent/directives/memory-mining.md new file mode 100644 index 0000000..31f220d --- /dev/null +++ b/src/Foragent.Agent/directives/memory-mining.md @@ -0,0 +1,75 @@ +# Memory-mining pass + +Goal: promote durable observations from recent `browser-task` runs +into `ILongTermMemory` so they prime future planning without growing +the skill store. + +## What belongs in memory (not in skills) + +Skills are **procedural** — "how to do X on site Y." Memory is +**declarative** — facts and observations that don't fit the +how-to-do-X shape. Examples of good memory entries: + +- "bsky.app enforces an email-code challenge roughly 1-in-10 logins + from fresh contexts" — site behaviour, not a procedure. +- "example.com served a 503 maintenance page on 2026-04-21; retries + after 2026-04-22 succeeded" — time-bounded incident. +- "the bsky compose editor rejects `FillAsync` on its ProseMirror + root; only keystroke-based typing works" — tooling quirk worth + remembering across capabilities. +- "Cloudflare challenge pages show a checkbox labelled 'Verify you are + human'; no successful automated bypass has been observed" — a + concrete negative finding. + +## What does NOT belong in memory + +- Credential values, tokens, typed field contents. +- Specific user data (post text, usernames, message bodies). +- One-off successful runs that are already captured by a learned + skill. +- Generic observations ("sites take time to load") — too vague to + retrieve usefully. + +## Inputs + +Recent tool-call logs and browser-task results, plus the existing +memory entries (so you don't duplicate). + +## What to look for + +1. **Site-level behaviours** that appear across multiple runs (captcha + prompts, rate-limit responses, maintenance windows, DOM changes). +2. **Tooling quirks** — situations where a capability's tool call + behaved unexpectedly in a specific way that would save future runs + time to know about. +3. **Negative findings** — things that were tried and *didn't* work, + saving a future planner from repeating the attempt. + +## Output format + +For each memory worth recording: + +``` +MEMORY {category} | [tags] +{One-paragraph observation. Lead with the specific, observable fact. +Keep under ~80 words. If it has a date boundary, include it +explicitly.} +``` + +Category should be `sites/{host}` when the observation is site-specific, +or a general category like `browser-tooling`, `captcha-patterns`, +`rate-limit-patterns` otherwise. + +Tags are a free-form subset of: `site-behaviour`, `tooling-quirk`, +`negative-finding`, `incident`, `rate-limit`, `captcha`, `dom-change`. +Keep to 1-3 tags per entry. + +## What not to do + +- **Do not** emit memory entries for already-captured facts. Skim the + existing memory first. +- **Do not** write essays. A memory entry should be a single tight + paragraph a future planner can retrieve and integrate quickly. +- **Do not** include credential values, typed content, or user data. +- **Do not** create memory entries for facts that would be better as a + skill — if it answers "how do I do X," it's a skill, not a memory. diff --git a/src/Foragent.Agent/directives/sequence-skill.md b/src/Foragent.Agent/directives/sequence-skill.md new file mode 100644 index 0000000..04dd3b4 --- /dev/null +++ b/src/Foragent.Agent/directives/sequence-skill.md @@ -0,0 +1,75 @@ +# Sequence-skill detection pass + +Goal: find repeated tool-call patterns across successful `browser-task` +runs and propose a named, reusable skill so future planners can +retrieve the pattern directly instead of rediscovering it. + +## Inputs + +The tool-call log for recent `browser-task` runs, grouped by primary +host. Each entry includes: +- The intent text. +- Ordered tool-call names (arguments omitted for privacy — you see + `type(ref, …)` without the value). +- Final outcome (`done` / `fail` / `incomplete`). +- Step count and duration. + +## What to look for + +1. **Recurrent prefixes.** Multiple successful runs that start with the + same 3+ tool calls (often "navigate to login URL, snapshot, click + Sign-in"). That's a candidate login primer. + +2. **Recurrent mid-sequences.** A 4–6 step pattern that appears inside + runs with different overall intents — e.g. "click menu → click + Settings → navigate Settings URL" appears in three different + settings-related tasks. That's a candidate navigation primer. + +3. **Recurrent error-recovery.** A pattern where the planner hits a + specific state, recovers, and succeeds (e.g. "dismiss cookie banner + → retry click"). Worth a primer so future runs skip the recovery + phase. + +## Threshold + +Require **at least 3 distinct successful runs** exhibiting the pattern +before proposing a skill. Two matches is coincidence; three is a +pattern worth remembering. + +## Output format + +For each sequence-skill candidate, emit: + +``` +UPSERT sites/{host}/{slug} | {summary-15-words-or-less} +# {Human-readable title} + +**When to use:** {one-sentence trigger — what a future planner's intent +text or current URL should look like} + +**Steps:** +1. {tool-call + reason, e.g. "navigate to https://{host}/login"} +2. … + +**Known pitfalls:** {brief; only if the evidence shows a recovery +pattern} + +**See also:** {list of existing related skills} +--- +``` + +Slug should be a short kebab-case name. Prefer verbs ("open-compose", +"dismiss-cookie-banner") over nouns. + +## What not to do + +- **Do not** propose a sequence skill where the only common element is + "navigate to site root, then snapshot." That's not a pattern, that's + the default shape of every task. +- **Do not** emit argument values — the log omits them for a reason. +- **Do not** emit sequence skills longer than ~8 steps. Long sequences + are fragile against site changes and benefit the next planner less + than a well-written shorter primer. +- **Do not** duplicate an existing skill. If `sites/{host}/login` + already exists and your candidate sequence matches it, either emit + nothing or emit an UPSERT that *improves* the existing one. diff --git a/src/Foragent.Agent/directives/skill-gap.md b/src/Foragent.Agent/directives/skill-gap.md new file mode 100644 index 0000000..5796aa1 --- /dev/null +++ b/src/Foragent.Agent/directives/skill-gap.md @@ -0,0 +1,68 @@ +# Skill-gap detection pass + +Goal: identify tasks that failed or struggled because Foragent was +missing a relevant site skill, and record the gap so future operator +priming or future dream passes can close it. + +## Inputs + +You will be given the recent `browser-task` traces where the outcome +was either: + +- `failed` (planner called `fail()`), or +- `incomplete` (budget exhausted before `done()` or `fail()`), or +- `done` but with an unusually high step count relative to peers for + the same host. + +Alongside each trace, you will see the skills (if any) that were +injected into the planner prompt as priming. + +## What to look for + +1. **Failures where no primer existed for the host.** If the task + targeted `sites/foo.example/*` and the skill store contains nothing + under `sites/foo.example/`, that's the clearest kind of gap. + +2. **Failures where the primer content did not cover the intent.** If + the task intent was "compose a post" but the only retrieved skill + was `sites/{host}/login`, the gap is a missing compose primer. + +3. **Recurring pain points within a single host.** Three failures on + bsky.app's 2FA email-code prompt in a week is worth a specific + entry, even if one successful run exists. + +4. **Tool thrash.** A trace with 30+ `snapshot` calls and no `click` + that succeeded usually means the planner didn't know which element + to target — a gap in selector-level guidance. + +## Output format + +For each gap identified, emit a memory entry: + +``` +MEMORY sites/{primary-host} | [tags] +Missing primer / selector coverage for {intent summary}. Evidence: {N +failed traces over {period}, most recent {date}}. Suggested content: +{one-paragraph hint on what a future primer should cover}. +``` + +Tags should be a subset of: `gap`, `failure-cluster`, +`missing-primer`, `selector-ambiguous`, `2fa-blocked`, +`budget-exhausted`. + +Multiple gaps per host are fine — keep them separate entries so +retrieval surfaces the specific flavour of gap that matches a future +query. + +## What not to do + +- **Do not** write a gap entry for a single failed task. Require at + least two failures or one failure plus one struggle (high step count) + before flagging. +- **Do not** invent selectors or URLs. The gap entry describes the + *shape* of the missing knowledge, not the content of it. +- **Do not** include the failing intent verbatim if it contains + personal content, usernames, or data — describe the *pattern*. +- **Do not** blame the planner for site changes. If the evidence shows + the site itself changed (new DOM, new domain), record that as a site + event, not a skill gap. diff --git a/src/Foragent.Agent/directives/skill-optimize.md b/src/Foragent.Agent/directives/skill-optimize.md new file mode 100644 index 0000000..079aaf7 --- /dev/null +++ b/src/Foragent.Agent/directives/skill-optimize.md @@ -0,0 +1,68 @@ +# Skill consolidation pass + +Goal: reduce the skill store to a smaller, clearer body of site +knowledge by merging duplicates and rewriting entries for clarity. + +## What to look for + +You will be given the current list of skills and their Summaries, +grouped by name prefix. Focus on: + +1. **Exact duplicates** — multiple skills under `sites/{host}/learned/` + that describe the same intent (e.g. `login-to-bsky-app`, + `sign-in-bluesky`, `authenticate-bsky`). Merge into one. + +2. **Primer + learned pairs** — an operator primer at `sites/{host}/{x}` + alongside one or more learned skills at `sites/{host}/learned/{…}` + describing the same flow. Improve the primer with whatever the + learned skills discovered (updated selectors, new failure modes, + faster paths). Delete the redundant learned entries. + +3. **Stale or superseded content** — a skill that claims "click the + button labelled X" when a later learned skill shows the label is now + Y. Prefer the newer evidence and say so. + +4. **Over-long skills** — anything past ~500 words that spends most of + its content on one-off anecdote rather than reusable procedure. + Rewrite for density. + +## When to merge + +Merge two skills when **all three** are true: +- They describe the same landing URL or same sequence of intents. +- Their successful flows overlap by more than half. +- The combined skill would still fit comfortably inside ~400 words. + +Do not merge skills that happen to target the same site but different +intents (e.g. `sites/bsky.app/login` vs `sites/bsky.app/compose-post`). +Those stay separate — different retrieval contexts. + +## Output format + +For each change you want to make, emit one of: + +- `DELETE {skill-name}` — remove a redundant skill. +- `UPSERT {skill-name} | {summary-15-words-or-less}` + followed by a markdown body on subsequent lines up to `---`. + +Example: + +``` +DELETE sites/bsky.app/learned/sign-in-with-app-password +DELETE sites/bsky.app/learned/log-into-bluesky +UPSERT sites/bsky.app/login | Log in to bsky.app with an app password; watch for 2FA challenges. +Bluesky's public web app is at https://bsky.app… +(full markdown body) +--- +``` + +## What not to do + +- **Do not** delete a skill whose name is in the protected-prefixes + list. Improve its content in place via UPSERT instead. +- **Do not** merge across sites. `sites/foo.example/login` and + `sites/bar.example/login` are different knowledge. +- **Do not** write speculative content. If the trace evidence does not + mention a specific selector, don't invent one — leave it vague. +- **Do not** drop citations — if a learned skill references a URL or + observed selector, preserve that detail in the merged output.