From 539aa437d4f1b354a906578279c6396a0fce22e0 Mon Sep 17 00:00:00 2001 From: Jonathan Jackson Date: Fri, 22 May 2026 03:07:33 -0600 Subject: [PATCH] docs: compress SKILL.md change logs + compact April PM runs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two cleanup items deferred from PR #406: 1. SKILL.md ## Change Log compression (28 files) - Drop "Initial version" entries (pure noise) - Keep most recent 3 substantive entries per change log - Drop entire section if only "Initial version" remained (4 sections) - ~25KB of historical narrative removed - Top trims: idea-to-pdd 14→3, ocs-chatbot-qa 11→3, connect-opp-setup 10→3, llo-launch 9→3, ocs-agent-setup 9→3 - Older entries preserved in git history; the most recent 3 entries are what an agent needs to understand current behavior. 2. .claude/pm/runs/ April logs compacted (10 files → 1 learning) - Apr 8 → Apr 29 covered 10 PM scout cycles that shaped the 0.13.x platform: archetypes-as-first-class, real-run-beats- spec-review, class-level preventers, doctor-probes-as- invariant-enforcement, operator-can-fix vs can't, stale- metadata-is-dangerous. - Distilled into docs/learnings/2026-04-pm-runs-compacted.md (cross-cycle patterns + shipped-findings table + dropped- items list). Each pattern points at the Convention/Gotcha it hardened into. - May 10 perf-lens run kept as-is (recent enough to read raw). - Originals preserved in git history. Files changed: 39 (28 SKILL.md compressions + 10 PM deletions + 1 new learning + 4 version files). Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude-plugin/marketplace.json | 4 +- .claude-plugin/plugin.json | 2 +- .../runs/2026-04-08-focus-group-framework.md | 77 --------- .../2026-04-15-end-to-end-user-journey.md | 71 -------- .../2026-04-16-core-workflow-end-to-end.md | 70 -------- .../runs/2026-04-17-internal-dimagi-admins.md | 67 -------- .../runs/2026-04-19-qa-eval-iteration-loop.md | 109 ------------- ...4-20-collection-clone-and-mcp-preflight.md | 105 ------------ .claude/pm/runs/2026-04-20-dead-env-vars.md | 92 ----------- .../2026-04-20-env-drift-adoption-blockers.md | 86 ---------- ...26-04-28-turmeric-dogfood-ocs-contracts.md | 151 ------------------ ...29-eval-rubric-polish-operator-cant-fix.md | 95 ----------- VERSION | 2 +- docs/learnings/2026-04-pm-runs-compacted.md | 64 ++++++++ package.json | 2 +- skills/app-deploy/SKILL.md | 1 - skills/app-release/SKILL.md | 2 - skills/app-screenshot-capture/SKILL.md | 3 - skills/connect-opp-setup/SKILL.md | 7 - skills/connect-program-setup/SKILL.md | 1 - skills/cycle-grade/SKILL.md | 1 - skills/email-communicator/SKILL.md | 6 - skills/flw-data-review/SKILL.md | 1 - skills/idea-to-pdd-eval/SKILL.md | 3 - skills/idea-to-pdd/SKILL.md | 11 -- skills/learnings-summary/SKILL.md | 6 - skills/llo-feedback/SKILL.md | 1 - skills/llo-launch/SKILL.md | 6 - skills/llo-onboarding/SKILL.md | 3 - skills/llo-uat/SKILL.md | 1 - skills/ocs-agent-setup/SKILL.md | 6 - skills/ocs-chatbot-eval/SKILL.md | 4 - skills/ocs-chatbot-qa/SKILL.md | 8 - skills/opp-closeout/SKILL.md | 1 - skills/pdd-to-app-journeys/SKILL.md | 2 - skills/pdd-to-deliver-app/SKILL.md | 4 - skills/pdd-to-learn-app/SKILL.md | 4 - skills/pdd-to-test-prompts/SKILL.md | 1 - skills/pdd-to-work-order-eval/SKILL.md | 1 - skills/pdd-to-work-order-qa/SKILL.md | 6 - skills/pdd-to-work-order/SKILL.md | 1 - skills/solicitation-create/SKILL.md | 3 - skills/timeline-monitor/SKILL.md | 6 - 43 files changed, 69 insertions(+), 1028 deletions(-) delete mode 100644 .claude/pm/runs/2026-04-08-focus-group-framework.md delete mode 100644 .claude/pm/runs/2026-04-15-end-to-end-user-journey.md delete mode 100644 .claude/pm/runs/2026-04-16-core-workflow-end-to-end.md delete mode 100644 .claude/pm/runs/2026-04-17-internal-dimagi-admins.md delete mode 100644 .claude/pm/runs/2026-04-19-qa-eval-iteration-loop.md delete mode 100644 .claude/pm/runs/2026-04-20-collection-clone-and-mcp-preflight.md delete mode 100644 .claude/pm/runs/2026-04-20-dead-env-vars.md delete mode 100644 .claude/pm/runs/2026-04-20-env-drift-adoption-blockers.md delete mode 100644 .claude/pm/runs/2026-04-28-turmeric-dogfood-ocs-contracts.md delete mode 100644 .claude/pm/runs/2026-04-29-eval-rubric-polish-operator-cant-fix.md create mode 100644 docs/learnings/2026-04-pm-runs-compacted.md diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 97721f68..4bb50c03 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -6,13 +6,13 @@ "url": "https://github.com/jjackson" }, "metadata": { - "version": "0.13.332" + "version": "0.13.333" }, "plugins": [ { "name": "ace", "source": "./", - "version": "0.13.332", + "version": "0.13.333", "description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout" } ] diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json index 0a209d36..303ea470 100644 --- a/.claude-plugin/plugin.json +++ b/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "ace", - "version": "0.13.332", + "version": "0.13.333", "description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout", "author": { "name": "Jonathan Jackson", diff --git a/.claude/pm/runs/2026-04-08-focus-group-framework.md b/.claude/pm/runs/2026-04-08-focus-group-framework.md deleted file mode 100644 index 0b8849dc..00000000 --- a/.claude/pm/runs/2026-04-08-focus-group-framework.md +++ /dev/null @@ -1,77 +0,0 @@ -## 2026-04-08 — focus-group-framework (custom lens) - -**Lens used:** `improved skill framework so we can run focus groups, see docs folder for background ideas` (custom — not one of the standard rotation lenses). - -**Background read:** `README.md`, `docs/superpowers/specs/2026-04-01-ace-design.md`, `docs/generated/playbook.md`, `docs/examples/idd-vaccine-hesitancy.md`, `docs/examples/idd-turmeric-market-survey.md`, `docs/examples/idd-stress-test-observations.md`, `templates/idd-template.md`, and the SKILL.md files for `idea-to-idd`, `idd-to-learn-app`, `idd-to-deliver-app`, `app-test`, `connect-opp-setup`, `flw-data-review`, `cycle-grade`. Also confirmed PR #3 added the example IDDs. - -**Core finding:** every existing skill is hard-coded to one delivery archetype — "one FLW visit = one photo + GPS + form." Never named as an assumption; baked into the IDD template, the section list in `idea-to-idd`, the Nova briefs, the verification vocabulary in `connect-opp-setup`, the quantitative queries in `flw-data-review`, and the grading dimensions in `cycle-grade`. A focus-group IDD walks in and silently breaks. The fix is **not** "add 4 new focus-group skills" (the stress-test doc's suggestion) — that forks the framework. The fix is to give skills variation points that branch on a declared archetype, plus a shared evidence-model vocabulary. - -### Do it - -1. **F2 — Stress-test rubric in `idea-to-idd`** — Effort: S — Status: **done, merged into PR #4** - - Branch: `emdash/pm-session-9n4` - - PR: jjackson/ace#4 - - Outcome: 5-question rubric (executability, verifiability, measurability, stage-gate clarity, resource realism) replaces the weak "is it complete enough" self-eval. Includes vaccine-hesitancy and turmeric IDDs as calibrated grading anchors. Stress-test results emitted as IDD appendix. - -2. **F1 — Delivery archetype as first-class concept** — Effort: M — Status: **done, in PR #4** - - Branch: `emdash/pm-session-9n4` - - PR: jjackson/ace#4 - - Outcome: New `Archetype:` field in `templates/idd-template.md` (atomic-visit | focus-group | multi-stage). `## Archetypes` section added to all 7 archetype-aware skills with concrete focus-group branches drawn from the stress-test doc and the vaccine-hesitancy IDD. atomic-visit is the default; new archetypes are additive PRs. - -3. **F3 — Evidence Model section + downstream consumption** — Effort: M — Status: **done, in PR #4** - - Branch: `emdash/pm-session-9n4` - - PR: jjackson/ace#4 - - Outcome: New `## Evidence Model` section in IDD template using Layer A (delivery proof) / Layer B (content proof) / Layer C (cross-delivery quality) vocabulary. `connect-opp-setup`, `app-test`, `flw-data-review`, `cycle-grade` now read from this section instead of re-deriving verification. Skills error if Evidence Model is missing. - -4. **F4 — CRISPR-Test-002 focus-group fixture** — Effort: M — Status: **done, in PR #4** - - Branch: `emdash/pm-session-9n4` - - PR: jjackson/ace#4 - - Outcome: Pair fixture to CRISPR-Test-001 (atomic-visit). Simplified vaccine-hesitancy IDD (Stage 1 only, 2 segments, 1 LLO), full Evidence Model, stress-test all-pass. README.md documents the regression spec for each archetype-aware skill against the fixture. Stub Learn + Deliver app summaries. - -### Backlog - -(none from this run — all 4 proposals were dispositioned "Do it" and shipped together) - -### Closed - -(none from this run — no proposals were rejected) - -### Skipped on this run (raised but not formally proposed) - -- **F5 — `skills/README.md` author contract**. Not yet urgent; would document required SKILL.md sections (frontmatter, Process, MCP Tools Used, Mode Behavior, Dry-Run Behavior, Change Log) plus optional Archetypes / LLM-as-Judge Rubric / Evidence Model. Natural follow-on once a third archetype or a non-Jon contributor starts touching skills. Hold for a future cycle's `tech-debt` lens. -- **Regenerate `docs/generated/playbook.md`**. `/ace:docs` is a slash command, not a script — it must be run by Claude. Noted in PR description as a post-merge step. Could become a hook (`afterMerge` for skills/ changes). -- **Add `archetype: atomic-visit` to CRISPR-Test-001's IDD**. Currently relies on the default fallback. Adding it explicitly would let the fixture demonstrate the new field. Trivial follow-up; not in this PR to keep the diff focused. - -### Meta-observations - -**What worked well:** -- Doing the docs read in parallel (Bash + Glob + Read in batched calls) was much faster than sequential. I read 7 files in 2 message turns. -- Bootstrapping `context.md` from the existing README + design spec + memory rather than asking 4 questions interactively saved a lot of round-trips. The skill explicitly allows skipping questions answerable from code — that flexibility was the right call here. -- The lens being a custom string (not one of the rotation 5) wasn't a problem. The skill's framing of "exploration lenses" as suggestive rather than prescriptive worked well — I just used the user's exact phrasing. -- The stress-test doc (`docs/examples/idd-stress-test-observations.md`) was load-bearing background. Without it I would have proposed a much weaker version of F2/F3, because that doc had already done the conceptual work — I was mostly formalizing what was already a one-off observation into framework-level structure. -- Using `AskUserQuestion` for per-proposal disposition kept the user in control without ambiguous bulk-chat answers. All 4 dispositions came through cleanly. - -**What was wasteful:** -- I made one duplicate-numbering mistake in the ordered list in `idea-to-idd/SKILL.md` (had two "step 4"s after the initial Edit) and had to do a follow-up renumber. Checking step numbering after each big ordered-list edit would have caught it inline. -- Same again in `app-test/SKILL.md` — needed a follow-up renumber after adding step 3. Pattern: any time I insert a step in the middle of an ordered list, I should grep `^[0-9]+\.` immediately after to verify sequential numbering. -- I read the focus-group example IDD twice — once to extract focus-group structure, once when checking the stress-test doc that links to it. Would have been fine to skip the second read. -- I checked `.gitignore` and `.claude/` tracking *after* writing files to `.claude/pm/` rather than before. Result was correct (those files stayed untracked, which was the right choice), but I could have established the rule before writing. - -**Prompt adjustments for next time:** -- For multi-skill framework changes like this one, the right number of proposals is 3–4, not the standard 3. The user dispositioned all 4 as "Do it" because they were tightly interdependent — splitting the 4 across two cycles would have shipped half a feature. The skill's "top 3" guidance is a soft cap, not a rule. -- When the user asks for a "framework" change, the wrong instinct is to add new skills/files. The right instinct is to add variation points to existing skills/files. I need to keep that as a working bias for any future "framework" lens. -- The `## Archetypes` section pattern (default + branches, declared once per skill) is reusable for *any* configuration that varies across IDD types. If a future cycle introduces something like `## Modalities` (online vs in-person) or `## Geographies` (regulatory branching), the same pattern should be considered before forking skills. - -**Confidence on validation:** -- Medium-high. F1, F3, F4 are well-instrumented in the SKILL.md text and the fixture has explicit pass/fail expectations per skill. Real validation requires running the skills against the fixture in a Claude session, which I can't do from the implementing session without round-tripping through the user. The test plan in PR #4 makes this explicit. -- Lower on F2 specifically — LLM-as-Judge rubrics are notoriously generous. The few-shot grading anchors (vaccine-hesitancy-as-fail, turmeric-as-near-pass) help, but the proof comes from running it. If the rubric grades the vaccine-hesitancy IDD as "pass" or grades the turmeric IDD as "fail," that's a false positive that needs the rubric tightened. - -### Self-improvement (canopy-skills meta-PRs) - -Three universal-improvement candidates surfaced from this run's meta-observations were proposed as PRs against `jjackson/canopy-skills`: - -1. **U1 — Custom lens support is first-class.** jjackson/canopy-skills#7. Adds a one-paragraph note to Phase 1 clarifying that custom lenses (not in the rotation list) are first-class. Two-line addition; no existing content removed. -2. **U2 — Top-N can exceed 3 when interdependent.** jjackson/canopy-skills#8. Softens Phase 2's "top 3" hard cap to a soft default with an explicit interdependence escape hatch. One-line edit. -3. **U3 — Lesson #9: Framework changes mean variation points, not new components.** jjackson/canopy-skills#9. Appends a 9th lesson encoding the parameterization-over-fork bias for "framework" lenses. - -All three PRs are open for jjackson review. Per the Self-Improvement Protocol, the skill is intentionally gated on human review before merging — no auto-merge. diff --git a/.claude/pm/runs/2026-04-15-end-to-end-user-journey.md b/.claude/pm/runs/2026-04-15-end-to-end-user-journey.md deleted file mode 100644 index c6091967..00000000 --- a/.claude/pm/runs/2026-04-15-end-to-end-user-journey.md +++ /dev/null @@ -1,71 +0,0 @@ -## 2026-04-15 — end-to-end-user-journey (custom lens) - -**Lens used:** "someone just tryin to use ace end to end to build out a program and deploy it" (custom — user-supplied phrasing, essentially an adoption-blocker / first-run-journey hybrid). - -**Background read:** `README.md`, `CLAUDE.md`, `agents/ace-orchestrator.md`, `agents/design-review.md`, `skills/idea-to-pdd/SKILL.md`, `commands/run.md`, `commands/step.md`, `commands/status.md`, `commands/setup.md`, `commands/doctor.md`, `commands/ocs-login.md`, `commands/ocs-bootstrap-template.md`, `bin/ace-doctor`, `.env.tpl`, top of `lib/artifact-manifest.ts`. Also confirmed recent restructure via `git log` (6-phase pipeline since 0.2.0, PDD rename in 0.3.0). - -**Core finding:** the install story is polished (`/ace:setup`, `/ace:doctor`, `/ace:update` all ship green checks), but the **first-run story is a silent cliff**. Three distinct gaps a new end-to-end user hits between "doctor green" and "first opp deployed": - -1. **`idea.md` has no bootstrap path.** `idea-to-pdd` reads from `ACE//idea.md`; the artifact manifest marks it `producedBy: 'external'`; the orchestrator's "Starting a New Opportunity" section (lines 119–124 pre-fix) just said "create folder, init state.yaml, begin Phase 1". A fresh `/ace:run my-new-opp` either fails deep in Phase 1 or has the LLM improvise an idea — neither is acceptable. -2. **README Quick Start doesn't match the real first-run.** Only listed `setup` / `doctor` / `run`. Missing the `.env` injection (required for OCS MCP + Gmail), `/ace:ocs-login`, and `/ace:ocs-bootstrap-template` — users succeed through Phase 3 then hit a wall at Phase 4. Architecture counts also stale (6 agents / 21 skills vs. real 8 / 22). -3. **`/ace:doctor` only checked install-time state.** No `.env` check, no OCS env var check, no OCS Playwright session check, no Gmail config check. A green doctor could still hand a user a broken runtime. - -### Do it - -1. **P1 — Orchestrator idea capture** — Effort: M — Status: **done, PR #30** - - Branch: `emdash/all-areas-argue-9lk` - - PR: jjackson/ace#30 - - Outcome: `ace-orchestrator.md` "Starting a New Opportunity" now checks for `ACE//idea.md` and prompts via `AskUserQuestion` for inline paste / Drive URL / abort if missing. `idea-to-pdd/SKILL.md` also fails fast with an actionable error when invoked via `/ace:step` without the file, instead of improvising an idea. Two layers of defense against the silent-failure mode. - -2. **P2 — README first-run walkthrough + stale counts** — Effort: S — Status: **done, PR #30** - - Branch: `emdash/all-areas-argue-9lk` - - PR: jjackson/ace#30 - - Outcome: New "First-Run Walkthrough" section with the ordered 8-step checklist (install → setup → GWS key → op inject .env → /ace:ocs-login → /ace:ocs-bootstrap-template → /ace:doctor → /ace:run --dry-run). Architecture section updated to 8 agents (with correct phase agent names) / 22 skills / 6 phases. - -3. **P3 — /ace:doctor runtime readiness** — Effort: S — Status: **done, PR #30** - - Branch: `emdash/all-areas-argue-9lk` - - PR: jjackson/ace#30 - - Outcome: `bin/ace-doctor` gains WARN-level checks for `env_file`, `ocs_env` (all three of OCS_BASE_URL / OCS_TEAM_SLUG / OCS_GOLDEN_TEMPLATE_ID), `gmail_config` (ACE_GMAIL_ACCOUNT), and `ocs_session` (~/.ace/ocs-session-.json with > 30 day freshness warning). Unresolved `op://…` references treated as missing. Each WARN has a concrete `fix:` hint pointing at the right command (`op inject`, `/ace:ocs-login`, `/ace:ocs-bootstrap-template`). Verified live: on my configured machine it reports 1 genuine WARN (gmail_config) where I hadn't populated 1Password. - -### Backlog - -(none from this run — all 3 proposals were dispositioned "Do it" and shipped together in PR #30) - -### Closed - -(none from this run) - -### Skipped on this run (raised but not formally proposed) - -- **Orchestrator pre-flight doctor call.** Considered proposing a 4th item: before dispatching Phase 1, have the orchestrator invoke `/ace:doctor` internally and bail if any WARN / FAIL is relevant to the phases about to run. With P1 (idea capture) and P3 (runtime WARNs) shipped, this becomes lower-value glue work — the user already gets actionable feedback if they run `/ace:doctor` first, and the P2 walkthrough puts that in their face. Revisit if users still report "I didn't know I needed X" after these changes land. -- **Make `/ace:docs` regenerate the README architecture counts.** Today's stale counts (fixed in P2) will rot again on the next restructure unless the numbers come from `/ace:docs` output rather than hand-edited prose. Potentially a one-line section in the generated playbook that README includes or links to. Hold for a future `tech-debt` lens. -- **Warning-to-step traceability in doctor.** `env_file` WARN has a `fix:` command, but that command assumes the user has 1Password CLI set up. A deeper version would probe for `op` availability and chain fixes. Out of scope for today's tight "fix the cliff" cycle. - -### Meta-observations - -**What worked well:** -- The lens was another custom string ("someone just tryin to use ACE end to end…"), and per the U1 improvement proposed last cycle, I used it directly without translating to the rotation list. Worked cleanly — the finding structure fell out of the lens itself. -- Running `bin/ace-doctor --here` on my own machine mid-implementation caught a real bug in my first pass (I'd forgotten to strip single quotes from env values, so `OCS_TEAM_SLUG='connect-ace'` was treated as `'connect-ace'` with quotes). Smoke-testing the script against a live environment before committing saved a round-trip. -- The 3-proposal cap worked here (unlike last cycle where 4 interdependent items wanted to ship together). These three are loosely coupled: P1 and P3 are both about "catch the failure before the user falls off the cliff," P2 is about narrative. They would have been fine to ship separately, but bundling was fine too. -- Pre-commit hook auto-synced `VERSION` to `package.json` / `plugin.json` / `marketplace.json` — the "edit VERSION only" rule from memory held up. -- Reading the artifact manifest (`lib/artifact-manifest.ts`) was load-bearing for P1. The line `producedBy: 'external'` was the smoking gun that confirmed idea.md had no programmatic source. Without that I might have wasted time hunting for an existing bootstrap skill. - -**What was wasteful:** -- I initially tried to `Write` VERSION without reading it first — hit the "must read first" guard. Small friction, but avoidable; the Read→Write rule should be reflex by now. -- Two `Read` calls on the same CHANGELOG (the Write guard again). Edit would have worked without either read since I was only appending at the top. -- Ran `npm install` silently then had to `ls node_modules/.bin/tsx` to confirm. `npm install` was ~30s of silent wait where I could have also been verifying the doctor script in parallel. - -**Prompt adjustments for next time:** -- The "fresh-user journey" lens is generative — it produced three proposals that together cover a cohesive user story (the first-run). Worth adding to the standard rotation or mental toolkit alongside `adoption-blockers`. It's more specific / actionable than `user-value`. -- When reviewing "runtime" (vs. "install") health, distinguish clearly between FAIL (tool won't run at all) and WARN (tool runs but a specific feature will fail). P3 was right to use WARN, not FAIL — the doctor would otherwise scream at a user who only wanted to do a design-review run. - -**Confidence on validation:** -- High on P3 — I ran `bin/ace-doctor --here` on my configured machine, saw the expected PASS / WARN / PASS / WARN / PASS output, and one WARN turned out to match reality (`ACE_GMAIL_ACCOUNT` really isn't set in my `.env`). -- Medium on P1 — the orchestrator + skill edits are prompt-level, so real validation requires running `/ace:run test-opp` in a fresh session with an empty Drive folder. The PR test plan flags this explicitly. -- High on P2 — purely documentation; I verified the walkthrough covers every prerequisite mentioned in `.env.tpl` comments and in `ocs-bootstrap-template.md`'s prereq list. - -### Self-improvement (canopy-skills meta-PRs) - -No universal-improvement candidates surfaced this cycle that weren't already proposed last time. The U1 / U2 / U3 PRs from 2026-04-08 (custom-lens support, soft top-N cap, framework-changes-mean-variation-points) are still the relevant unmerged learnings; this run re-validated U1 and U2 in practice without needing a new PR. - -One soft observation: the `product-management` skill's Phase 1 guidance doesn't emphasize **smoke-testing your implementation on your own machine before committing**, which caught a real bug for me on P3. The existing "If validation fails: Fix the issues and re-run validation" language in Phase 4 covers this implicitly but doesn't promote it to a core practice. Noting here for a future consolidation pass rather than a one-off PR. diff --git a/.claude/pm/runs/2026-04-16-core-workflow-end-to-end.md b/.claude/pm/runs/2026-04-16-core-workflow-end-to-end.md deleted file mode 100644 index c4173ba8..00000000 --- a/.claude/pm/runs/2026-04-16-core-workflow-end-to-end.md +++ /dev/null @@ -1,70 +0,0 @@ -## 2026-04-16 — core-workflow-end-to-end (custom lens) - -**Lens used:** "trying to make sure the core workflow works end to end" (custom — user-supplied arg, translated as a core-workflow-across-all-phases lens). - -**Background read:** `.claude/pm/context.md`, `.claude/pm/learnings.md`, previous run log (2026-04-15 end-to-end-user-journey). `lib/artifact-manifest.ts`, `commands/{run,step}.md`, all 6 phase agents (`ace-orchestrator`, `design-review`, `commcare-setup`, `connect-setup`, `ocs-setup`, `llo-manager`, `closeout`), `test/fixtures/{CRISPR-Test-001,CRISPR-Test-002}/`, `test/fixtures/artifact-manifest.test.ts`, `CRISPR-Test-002/validation-2026-04-08.md`, a handful of skills (`ocs-chatbot-qa`, `idea-to-pdd`, `app-deploy`, `llo-onboarding`) for dry-run and input-contract details. - -**Core finding:** 0.3.1 polished install and first-run. The remaining end-to-end risk is **across phases and across the full lifecycle**: fixture drift, silent prerequisite failures, and test coverage that stops at Phase 3. - -Three distinct gaps between "install green" and "full pipeline runs": - -1. **Fixtures' `state.yaml` predates the 0.2.0 phase restructure.** Listed 19 flat skills; missing `pdd-to-test-prompts` and any form of `ocs-chatbot-qa`. No phase grouping. Since the 2026-04-08 walk-through explicitly said "This is not an actual /ace:run," nobody had verified the current (post-0.2.0) pipeline against the current fixtures. -2. **`/ace:step` has no prerequisite check.** Violates the 2026-04-15 learning (`skills that read external-human inputs must fail loudly, not improvise`). `/ace:step ocs-chatbot-qa --deep` silently fails when `test-prompts.md` hasn't been produced. -3. **Manifest test only validates up to Phase 3.** `artifact-manifest.test.ts` line: `validateFixture(files, 'connect', ['README.md'])`. Phases 4–6 are uncovered — manifest drift in OCS, operate, or closeout won't trip CI. - -### Do it - -1. **P1 — Refresh CRISPR-Test-001 state.yaml + validation-2026-04-16.md** — Effort: M — Status: **done, shipped 0.3.2 (commit 29d7a45)** - - Branch: `emdash/new-pm-8uq` - - Outcome: `state.yaml` rewritten to a phases → skills nested map covering all 22 skills (including the three `ocs-chatbot-qa` modes). Gate list updated to the five actual review-mode gates. Fresh desk-trace walk-through at `test/fixtures/validation-2026-04-16.md` supersedes the 2026-04-08 doc; documents input/output flow through every phase, calls out remaining gaps for P2/P3. - -2. **P2 — Prerequisite check in /ace:step via artifact manifest** — Effort: S-M — Status: **done, shipped 0.3.2** - - Outcome: `commands/step.md` now specifies a manifest-driven check. Before dispatching, `artifactsConsumedBy()` is enumerated; any missing `required: true` artifact (skipping `producedBy: external` and dated/recurring paths) fails loudly with an error that names each missing file and its producer skill. Closes the silent-failure bypass path. - -3. **P3 (redirected) — CRISPR-Test-003-Turmeric complete E2E fixture + extended manifest test** — Effort: M+ — Status: **done, shipped 0.3.2** - - Redirect from the user: instead of narrowing/widening the manifest test, build a **new** complete E2E fixture seeded from `docs/examples/pdd-turmeric-market-survey.md` and use it to test. - - Outcome: new `test/fixtures/CRISPR-Test-003-Turmeric/` with every required artifact stubbed — idea/PDD/test-prompts through closeout/cycle-grade.md (27 artifact files + README + state). `artifact-manifest.test.ts` extended with two new assertions: zero unexpected files, zero missing required artifacts at `upToPhase: 'closeout'`. Both pass. 11 tests total in the manifest suite, 76 passing across the full `npm test` run. - -### Backlog - -(none from this run — all 3 proposals dispositioned "Do it" and shipped as 0.3.2) - -### Closed - -(none from this run) - -### Skipped on this run (raised but not formally proposed) - -- **Actually run `/ace:run CRISPR-Test-003-Turmeric --dry-run` in a separate session.** The fixture + refreshed state.yaml + validation doc make this much easier to do, but an actual live run against the MCPs is a separate qualification effort. Noted as follow-up — worth its own cycle if the fixture shape surfaces unexpected friction when dispatched through a real orchestrator. -- **Orchestrator reading artifact-manifest for its own prereq checks.** P2 wires this into `/ace:step`, but the orchestrator itself could benefit from the same manifest lookup on each phase transition (defense in depth). Out of scope for "one cohesive cycle" and the orchestrator already runs skills in dependency order. Revisit if manifest drift shows up in the live path despite P3's CI coverage. -- **Documenting the 3-fixture contract.** `CRISPR-Test-001` = partial input fixture for `ocs-agent-setup`; `CRISPR-Test-002` = focus-group/archetype-stress fixture (Phase 1–3); `CRISPR-Test-003-Turmeric` = complete E2E. A README in `test/fixtures/` would disambiguate. Hold for a tech-debt lens. - -### Meta-observations - -**What worked well:** -- Following the 2026-04-15 U1 pattern — treating the user's custom string ("trying to make sure the core workflow works end to end") directly as a lens — worked cleanly again. The three proposals all fell out of the lens naturally without my having to snap it onto the canonical rotation list. -- Reading the artifact manifest (`lib/artifact-manifest.ts`) before proposing was load-bearing. The list of required artifacts per phase told me exactly what needed to exist in the E2E fixture; `artifactsConsumedBy()` was the missing piece that /ace:step wanted. -- The redirect on P3 produced a strictly better outcome. My original proposal was "extend the test." The user's redirect was "build a new fixture seeded from turmeric PDD and use that to test." The new fixture is also a tech-debt and onboarding asset, not just a CI check — so the same work buys more value. Worth remembering: when the user redirects, the reshaped version often covers both the original scope AND an orthogonal benefit. -- Validating fixture coverage by running the test (`npm test -- test/fixtures/artifact-manifest.test.ts`) as I went, not at the end, caught one iteration of "did I actually hit every required artifact?" without a silent gap slipping through. - -**What was wasteful:** -- The first `npm test` invocation hit `vitest: command not found` because node_modules was stale in the worktree. Running `npm install` upfront during Phase 1 would have avoided the mid-P3 friction. Next cycle: when I know I'm going to be running tests, prime the env before scouting depths. -- I created the fixture directory tree with one `mkdir -p` and wrote files in two batches of 5–6 at a time. The batching was fine for prompt-output length but required more sequential rounds than necessary. A single parallel-write-all batch would have been faster. -- The proposal table in the scouting output was reasonably dense but included the full "What / Why / Validate" text inside each `AskUserQuestion` — which worked, but the questions were long. Could have been tighter with links to a per-proposal section in the run log. - -**Prompt adjustments for next time:** -- When a fixture is load-bearing for a test, check the test assertions BEFORE writing the fixture, not after. I did this in the right order (read `expectedMissing` list first), but it's worth making a reflex rule. -- When the lens is "does X work end to end," the move is *always* to write/refresh a dry-run trace document. That's the only artifact that lets a human see the full flow without running the plugin. Make that the default first deliverable for this class of lens. - -**Confidence on validation:** -- **High on P2 (command-level prereq check).** Purely specification-level; the contract is testable once an implementation exists. The learnings.md preference already exists to guide anyone reading it. -- **High on P3 (fixture + test extension).** Tests pass with zero missing / zero unexpected. Any manifest drift now fails loudly. -- **Medium on P1 (state.yaml schema refresh).** The schema is structurally consistent with orchestrator/agent specs, but the orchestrator doesn't have code that *enforces* the nested shape — a human reads state.yaml in review mode. If the live orchestrator flattens or re-interprets the schema silently, the refresh may need another pass. - -### Self-improvement (canopy-skills meta-PRs) - -No universal-improvement candidates that warrant a fresh PR this cycle. The two standing observations from 2026-04-08 and 2026-04-15 still apply: -- **Custom lenses from user args work well** (U1 — already a pending PR from last cycle). Re-validated again today. -- **Phase 4-ish "smoke-test your work before committing"** is the closest thing to a new universal observation from this cycle — I hit a fresh instance of it with the stale `vitest` binary. Not strong enough to justify its own PR; the existing Phase 4 "Fix the issues and re-run validation" already covers this implicitly. - -One soft observation specific to this class of "across-all-phases" lens: the PM skill's scout-phase guidance says "Run the test suite — what passes, fails, is missing?" but doesn't call out that **fixture drift against a specification/manifest is a distinct failure mode from tests failing**. Today's scout found three gaps that all showed up as "tests pass" — a working test suite with stale assumptions is a harder failure to see than a red test. Not worth a meta-PR on its own, but noting it for a future consolidation pass. diff --git a/.claude/pm/runs/2026-04-17-internal-dimagi-admins.md b/.claude/pm/runs/2026-04-17-internal-dimagi-admins.md deleted file mode 100644 index a5559158..00000000 --- a/.claude/pm/runs/2026-04-17-internal-dimagi-admins.md +++ /dev/null @@ -1,67 +0,0 @@ -## 2026-04-17 — internal-dimagi-admins (custom lens) - -**Lens used:** "internal dimagi users who are going to be creating full opps via ace" — custom user-supplied arg, treated as an admin-group-coordination lens (how the 5-person CRISPR admin group actually uses ACE day-to-day when juggling multiple opps). - -**Background read:** `.claude/pm/context.md`, `.claude/pm/learnings.md`, previous run log (2026-04-16 core-workflow-end-to-end). `commands/{run,step,status,doctor}.md`, `agents/{ace-orchestrator,design-review,connect-setup,llo-manager}.md`, `skills/{idea-to-pdd,ocs-chatbot-qa,app-deploy,llo-invite,llo-launch,timeline-monitor}/SKILL.md`, `test/fixtures/CRISPR-Test-00{1,3}-Turmeric/state.yaml`, `lib/artifact-manifest.ts`, `test/fixtures/artifact-manifest.test.ts`, `skills/README.md`, `README.md`. Mid-cycle: read the **CRISPR-Connect Vision and Plan** Google Doc (the user pointed to it after Phase 3 dispositions). - -**Core finding:** 0.3.2 closed the "does the pipeline work end to end" gap. The remaining end-to-end risk is no longer mechanical — it's **legibility for the admin group**: who owns what, which opps need action, and what to actually check at each gate. Three state-schema + command spec edits addressed: - -1. **`/ace:status` surfaces "which opps need me right now."** Pre-0.3.3 the list view was a flat `Phase | Step | Mode | Updated` — admins had to infer "needs action" by reading state.yaml per opp. Now each row carries a derived status tag (`ACTION NEEDED` / `RUNNING` / `IDLE` / `ERROR` / `DONE`) and a `Blocked on` column (`gate: ` / `error: ` / `input: `); rows sort `ACTION NEEDED` first. -2. **No "who's driving this" field in state.yaml.** With 5 admins and N opps, hand-offs (Neal → Matt on a Tuesday gate) had no attribution trail without Slack. Added `initiated_by` (one-time, at creation) and `last_actor` / `last_actor_at` (updated on every skill invocation, both `/ace:run` and `/ace:step`). `/ace:status` renders "last touched by X, N days ago"; `--mine` filters to the current operator's git-config email. -3. **Gate approvals were context-thin.** Pre-0.3.3 the orchestrator paused with a bare `AskUserQuestion`. The 2026-04-08 stress-test PDDs (both failed rubric) would have rubber-stamped through. Defined a uniform gate-brief contract: skill writes `gate-briefs/.md` with a fixed structure (artifact path, 3–5 imperative checklist items, auto-surfaced concerns tagged `[BLOCKER]` / `[WARN]` / `[INFO]`, recommended disposition). Orchestrator reads + displays verbatim before the `AskUserQuestion`. Missing brief = fail loudly. - -### Do it - -1. **P1 — Status tags + sort + `--mine` in `/ace:status`** — Effort: S-M — Status: **done, shipped 0.3.3** - - Outcome: `commands/status.md` rewritten. Rule table (gate pending → ACTION NEEDED, step=error → ERROR, recurring-only → IDLE, cycle-grade=done → DONE) captured precisely. Default view drops `Mode` column (kept in detail view). Footer shows counts and --all hint. - -2. **P2 — Add `initiated_by` / `last_actor` / `last_actor_at` to state.yaml** — Effort: S — Status: **done, shipped 0.3.3** - - Outcome: new `## State Schema` and `## Touching State — Operator Capture` sections in `agents/ace-orchestrator.md`. `/ace:step` spec adds step 4 "Update operator identity" before dispatch. Source: `git config user.email`; fallback `unknown`. Identity is *captured, not enforced* — a git config mismatch just means `--mine` won't find the opp; no authorization check. Fixture state.yaml files updated for both CRISPR-Test-001 and CRISPR-Test-003-Turmeric. 76 existing tests still pass. - -3. **P3 — Gate-brief contract + 5 skill emits + manifest entries** — Effort: M — Status: **done, shipped 0.3.3** - - Outcome: `§ Gate Brief Contract` in `agents/ace-orchestrator.md` defines the required markdown shape (4 sections: Artifact Under Review, What to Check, Auto-Surfaced Concerns, Recommended Disposition). Each of the 5 gate-owning skills (`idea-to-pdd`, `app-deploy`, `ocs-chatbot-qa` `--deep` only, `llo-invite`, `llo-launch`) gained a `## Gate Brief` section naming the specific checklist items and concern signals for that gate. `lib/artifact-manifest.ts` gained 5 `gate-briefs/.md` entries (required, consumed by `ace-orchestrator`), one per phase where the producing skill lives. `CRISPR-Test-003-Turmeric` ships 5 stub gate briefs; `CRISPR-Test-001`'s `expectedMissing` list updated for 3 new design/commcare/connect gate briefs. Auto-mode contract: skills still write briefs, orchestrator doesn't pause, but a `[BLOCKER]` in an auto brief escalates to the admin group — admins opted into speed, not known-broken sends. - -### Backlog - -(none from this run — all 3 proposals dispositioned "Do it" and shipped as 0.3.3) - -### Closed - -(none from this run) - -### Skipped on this run (raised but not formally proposed) - -- **`/ace:abort `** — clean cancellation for experimental opps. Real need (admins will experiment with junk names, Drive folders will accumulate), but narrow — can `rm` the Drive folder manually today. Hold for a tech-debt lens. -- **Admin "Day 2" runbook** — a doc section explaining what admins approve at each gate, how hand-offs work, how to interpret `/ace:status`. Valuable but one-shot; the gate-brief checklists (P3) cover most of the per-gate question, and `/ace:status` UX (P1) covers most of the daily-triage question. Revisit if multiple admins hit the same onboarding question. -- **MSA / Work Order contracting flow** — the vision doc describes an MSA + WO model (LLOs know it's AI, MSA caps budget, each WO has accept + do-by deadlines, ace-mailing-list cc'd on every conversation). `llo-invite` today just produces a prepared list; there's no WO issuance, no deadline-tracker, no "LLOs know they're talking to AI" framing in the onboarding email. Substantial Phase 3/5 scope. A cycle of its own. -- **`/ace:status` recurring-skill signals** — surfacing "timeline-monitor hasn't run in 2 weeks" or "ocs-chatbot-qa --monitor score dropped 2 points" would be a richer IDLE row. Needs a trigger mechanism for recurring skills first (no scheduler today). Larger cycle. -- **Ownership enforcement** — an explicit "this opp is assigned to X" field with read/write boundaries. Deliberately *not* done. Identity-is-captured-not-enforced keeps the admin group frictionless. - -### Meta-observations - -**What worked well:** -- **Vision-doc read in Phase 3, not Phase 1.** The user dropped the CRISPR-Connect Vision and Plan doc *after* dispositions. Reading it after proposals were locked but before implementation was the right order: it reinforced P3 ("micromanage ACE" framing, "ACE going wrong" budget hedge, "easy for Dimagi humans to follow along") without derailing scope. If I'd read it in Phase 1, I would have been tempted to propose MSA/WO work that isn't ready for a cycle. -- **Identity-captured-not-enforced was the right call.** Temptation was to add "assigned_to" with per-opp ownership gates. Held the line — capture `last_actor`, let `--mine` be a filter, don't build a permission system for a 5-person team. -- **Gate brief as a separate file, not inline in the artifact.** Briefly considered inlining a "## Gate Brief" section inside `pdd.md` / `deployment-summary.md`. Separate files win: keeps artifacts clean for downstream skills that consume them, and each skill doesn't need to coordinate section-anchor conventions with its peers. Also makes the manifest-driven check trivial. -- **Reading all 5 gate-owning skills before writing any `## Gate Brief` section.** Each skill produces different-shaped output (PDD stress test vs. deployment summary vs. QA scorecard vs. invite list vs. launch record). Unifying the brief *shape* while letting each skill's *checklist* be domain-specific means admins get a consistent UX with context-specific content. Would have been worse to try to templatize the checklist itself. - -**What was wasteful:** -- **`npm install` forgot to run before `npm test`, again.** Same friction as the 2026-04-16 cycle (vitest missing). The 2026-04-16 run noted this as a soft observation ("prime the env before scouting depths"); today I still hit it. Promoting to a concrete rule for my next run: **if the task is an implementation task on a Node project, run `npm install` as part of Phase 1 (scout) preflight, not when tests fail**. Planning to propose a universal-improvement PR at the bottom. -- **Initial draft of the gate-brief shape had 6 sections; trimmed to 4.** First pass had separate "Severity summary" and "Auto-Surfaced Concerns" sections and a "Related artifacts" bullet list. Collapsed all into the Concerns block (which already carries severity via tags) and removed the related-artifacts section (Artifact Under Review already has the primary path; admins who want more can open the Drive folder). Rule of thumb: if the brief is ~15 lines or fewer per gate, admins read it; 30 lines becomes another artifact to skim. - -**Prompt adjustments for next time:** -- When adding required artifacts to `lib/artifact-manifest.ts`, **immediately** check the test file's `expectedMissing` hardcoded list for partial fixtures — new `required: true` entries must be added there too, or CRISPR-Test-001 goes red. Today I caught this by running the test; next time I should edit the test in the same commit as the manifest edit rather than discovering the gap. -- The vision-doc mid-cycle read showed that sometimes the user has more context to offer than what's in `context.md` — and that context is in a Google Doc, not the repo. Worth proactively checking (or asking) whether there's a strategy doc in Drive I should read at the start of a lens that touches roadmap, not just code shape. - -**Confidence on validation:** -- **High on P1 (status rendering).** Purely a command-spec rewrite; rule table is deterministic and maps cleanly onto state.yaml fields that already exist plus the 3 new ones. A manual /ace:status trial against CRISPR-Test-001 / CRISPR-Test-003-Turmeric will confirm — left for a dry-run cycle. -- **High on P2 (ownership fields).** State schema extension is additive; existing fixtures updated; 76/76 tests pass. The only runtime risk is "skill / orchestrator forgets to update `last_actor`" — but that's a lint-level concern mitigated by the explicit contract in `§ Touching State`. -- **High on P3 (gate-brief contract).** Spec-level; each skill's `## Gate Brief` section names exact signal sources the skill already produces (stress-test grades for idea-to-pdd, pass/warn/fail counts for ocs-chatbot-qa, etc.). The 5 synthetic gate-brief stubs in CRISPR-Test-003-Turmeric show the shape concretely. Manifest test passes with zero missing / zero unexpected. - -### Self-improvement (canopy-skills meta-PRs) - -One observation worth consolidating: - -**"Run `npm install` (or the language equivalent) as preflight in Phase 1 for Node / JS projects, not on first test failure."** This is the second cycle in a row where I hit `vitest: command not found` mid-implementation. The current PM skill's Phase 1 guidance says "Run the test suite — what passes, fails, is missing?" — it should add "and if the project has a lockfile, ensure deps are installed before scouting so the same `npm test` invocation works in Phase 4 (Implement) and Phase 5 (Validate)." Not a cycle-blocker but a consistent 1-2 minute tax. **Candidate for a future consolidation PR** (didn't merit a fresh PR today, but if it happens a third time, it promotes to a universal-PR candidate). - -Beyond that: no fresh universal candidates. The "custom lenses from user args" pattern (three cycles in a row now) and the "build-a-fixture-before-writing-assertions" pattern (2026-04-16) are still standing observations; both still apply. diff --git a/.claude/pm/runs/2026-04-19-qa-eval-iteration-loop.md b/.claude/pm/runs/2026-04-19-qa-eval-iteration-loop.md deleted file mode 100644 index 30f25544..00000000 --- a/.claude/pm/runs/2026-04-19-qa-eval-iteration-loop.md +++ /dev/null @@ -1,109 +0,0 @@ -## 2026-04-19 — qa-eval-iteration-loop (custom lens) - -**Lens used:** "iterate on cosmetics-fgd-pilot end-to-end, fix gaps as they surface, minimal check-ins with clear choices." Custom session-scoped lens motivated by Neal's lead-exposure portfolio push — specifically the Cosmetics FGD Guide as the first real-content focus-group opp. Scoped away from Nova / CommCare app creation (another team owns that); focus on ACE's own skill chain and the new qa/eval + opp-eval infrastructure. - -**Background read:** `CLAUDE.md`, `skills/README.md`, `agents/ace-orchestrator.md`, `agents/{ocs-setup,ocs-tester,llo-manager,design-review}.md`, `skills/{idea-to-pdd,pdd-to-test-prompts,pdd-to-learn-app,pdd-to-deliver-app,ocs-chatbot-qa,ocs-agent-setup,connect-opp-setup,llo-invite,cycle-grade}/SKILL.md`, `lib/artifact-manifest.ts`, `commands/{run,step,status}.md`, `templates/pdd-template.md`, `test/fixtures/artifact-manifest.test.ts`, and prior PM runs `2026-04-08` (focus-group framework), `2026-04-15`, `2026-04-16`, `2026-04-17`. Mid-cycle: Neal's "Going big on lead exposure with Connect" Google Doc (cosmetics + geophagy FGD guides, portfolio framing for the 6 lead-exposure programs). - -**Core finding:** ACE's infrastructure for archetype-varying opps and umbrella evaluation was mostly designed but **never end-to-end validated against real content**. This cycle exercised the chain against Neal's cosmetics FGD guide and found the gaps are at the seams: contract drift between skills, silent failure modes in external integrations (OCS), and bypass paths the spec didn't defend. Six shipped PRs worth of surgical fixes now make the qa/eval + opp-eval + archetype-branching story fully coherent. Zero net-new capabilities (fgd-synthesis intentionally deferred per user); all work was existing-surface hardening driven by real diagnostics. - -### Do it - -1. **qa/eval split refactor — `ocs-chatbot-qa` → qa (capture) + `ocs-chatbot-eval` (judge)** — Effort: M — Status: **done, shipped 0.3.5** - - PR: jjackson/ace#31 - - Outcome: Split `ocs-chatbot-qa` into two skills per the two-phase pattern. qa captures a transcript + runs structural checks (response received, citations present, no errors); eval reads the transcript and runs the LLM-as-Judge rubric. New `skills/README.md § QA vs Eval — the two-phase pattern` codifies the contract: `qa-captures/` for evidence, `verdicts/-.yaml` for machine verdicts, `eval-reports/` for human reports. Uniform verdict YAML shape so the future umbrella aggregator can consume any skill's verdict. 23 files touched; gate brief renamed `ocs-chatbot-qa-deep.md` → `ocs-chatbot-eval-deep.md` (gate is on judgment, not capture). State-key split: `ocs-chatbot-qa-{quick,deep,monitor}` + `ocs-chatbot-eval-{quick,deep,monitor}`. - -2. **`ace:opp-eval` umbrella aggregator skill** — Effort: M-L — Status: **done, shipped 0.4.0** - - PR: jjackson/ace#32 (dispatched to subagent; 490-second solo run, 14-step Process, renormalized weights, fault-tolerant YAML parsing) - - Outcome: New `opp-eval` skill + `/ace:eval` command. Three modes (`--quick` structural, `--deep` aggregation + recommendations, `--monitor` deep + trend). Reads every `verdicts/*.yaml` in the opp folder, groups by 6 skill-category dimensions (design/commcare/connect/ocs/operate/closeout), computes weighted overall with weights renormalized across non-null categories (so a partial opp isn't penalized for being early). Emits per-skill recommendations and a uniform-contract advisory gate brief (does not gate a phase). Archetype-agnostic by design — per-skill evals already applied archetype-specific rubrics. 7 new manifest entries. Answers the user's original ask for "one overview judge/review agent that we can apply to overall runs." - -3. **Iter 1: `pdd-to-test-prompts` archetype branching** — Effort: S — Status: **done, shipped 0.4.1** - - PR: jjackson/ace#33 - - Outcome: Added `## Archetypes` section with per-archetype category lists. `focus-group` gets session-flow / recruitment-and-venue / consent-and-recording / question-guide-sequencing / facilitation-technique / output-spec / audio-and-evidence; `atomic-visit` retains visit-flow / eligibility / GPS / duplicate-handling; `multi-stage` mixes per-stage + adds stage-gate-transition. Surfaced during Iter 0 (cosmetics FGD Phase 1 recon) — subagent running the skill had to manually remap every atomic-visit-worded category. A weaker LLM would miss it and produce atomic-visit prompts against an FGD PDD, cascading into `ocs-chatbot-eval --deep` false-positive failures. Archetype-aware skill count 7 → 8. - -4. **Iter 3: `llo-invite` archetype branching** — Effort: S — Status: **done, shipped 0.4.2** - - PR: jjackson/ace#34 - - Outcome: `focus-group` selection criteria emphasize qualitative research experience (or training willingness), language/cultural fit for sensitive topics, audio-recording capability, facilitator time budgeting, and a **small-N bias** (1–2 LLOs, not 3–5). Gate brief gains FGD-specific WARN: count > 2 without justification, or rationale silent on facilitation capability. Archetype-aware skill count 8 → 9. Field-level enforcement (gate brief WARNs) ensures the shift lands even under weaker dispatches. - -5. **Iter 7: Contract cleanup + orchestrator hardening** — Effort: M — Status: **done, shipped 0.4.3** - - PR: jjackson/ace#35 - - Outcome: Six contract fixes + one orchestrator hardening. (1) `per_item:` canonical for per-item verdict list; `per_prompt:` in ocs-chatbot-eval renamed, with `prompt:` kept as domain-specific subkey inside each entry. (2) `auto_surfaced:` promoted to optional top-level verdict field so opp-eval can aggregate across skills. (3) `ACE/golden-template/` documented as canonical no-opp fallback path root (both qa and eval). (4) `ocs_send_test_message` MCP tool flagged as structurally incomplete — returns only `response`, missing `cited_files`/`tags`/`session_id`/`elapsed_ms` — raw widget HTTP is load-bearing. (5) OCS env vars pinned to `$CLAUDE_PLUGIN_DATA/.env`. (6) opp-eval quick-mode template adds `Unexpected:` row, tightens Notes examples, specifies stdout format. Orchestrator: state.yaml schema example upgraded from abstract to concrete (all 6 phases, qa/eval split step keys, `ocs-chatbot-eval-deep` gate); new `Defensive state.yaml init on bypass paths` section; `/ace:step` step 4 now ensures state.yaml before updating last_actor. The last one closes the bug I hit myself in cosmetics-fgd-pilot setup (direct `ace:design-review` Agent-tool dispatch bypassed `/ace:run` and the opp never got a state file). - -6. **Iter 6: Golden template fix + bootstrap defense** — Effort: M — Status: **done, shipped 0.4.4** - - PR: jjackson/ace#36 (dispatched to subagent after rate-limit retry; live OCS state change + code fix) - - Outcome: Diagnosed + fixed a silent-publish-block bug on the deployed golden template (experiment 11792). Root cause: `OCS_SHARED_COLLECTION_ID=718` pointed at a collection that didn't exist on team `connect-ace`. `ocs_attach_knowledge` silently succeeded at the pipeline-patch layer but then blocked every `publishChatbotVersion` call with the opaque UI message "Unable to create a new version when the pipeline has errors." v1 (empty post-clone state) stayed as the default version; embedded widget served vanilla LLM. Bot suggested DoorDash/Route4Me for "flagged deliveries." Live fix: restored canonical system prompt (PDD not IDD, `ace@dimagi-ai.com`, emoji-discouraged guidance), removed phantom collection 718, republished to v2. Code fix: `scripts/bootstrap-ocs-golden-template.ts` now pre-flight validates the collection exists on the team (via a new `listCollectionIndexIds` helper that scrapes the edit page — OCS has no REST endpoint) and skips gracefully with a loud actionable warning if missing. **Template score: 3.84/10 FAIL → 8.2/10 PASS.** - -### Backlog - -Prioritized; most items are direct follow-ups from Iter 6's root cause. - -**P1 — OCS robustness (next cycle):** -- **Add `ocs_list_collections` MCP tool.** bootstrap-ocs-golden-template.ts had to scrape the chatbot edit page because OCS exposes no REST endpoint for collections. Small wrapper; unblocks future defensive checks in other scripts / skills. -- **`publishChatbotVersion` pre-flight validation in `mcp/ocs/backends/playwright.ts`.** Post the current graph through `/pipelines/data/` first and surface any `errors.node` entries as a `PipelineValidationError` before attempting version creation. The silent-publish-block that bricked the golden template for weeks was hidden by exactly this gap. -- **`ocs-agent-setup` SKILL pre-flight check on `OCS_SHARED_COLLECTION_ID`.** Every per-opp bot the skill clones hits the same silent-block risk if the env var is stale — same class of bug, new blast radius. - -**P2 — Archetype coverage audit (next cycle):** -- Audit `connect-program-setup`, `training-materials`, `llo-onboarding`, `llo-uat`, `llo-launch`, `llo-feedback`, `app-test`, `flw-data-review` for silent atomic-visit defaults. Iter 3 (connect-opp-setup) was already solid; Iter 1 + Iter 3 took archetype-aware count 7 → 9. The remaining gap is 7 more skills that may or may not need branching. - -**P3 — Rubric proliferation (following cycles):** -- Add `## LLM-as-Judge Rubric` sections to skills that lack them. opp-eval emits `[INFO] skill X lacks a rubric` for every skill without one — the forcing function the 0.4.0 work surfaced. Highest-signal first: `app-test`, `flw-data-review`, `cycle-grade` (already has dimensions but not the rubric format). - -**P4 — Dogfood: real Phase 4 on cosmetics-fgd-pilot:** -- Now that the golden template is fixed, run `ocs-setup` end-to-end on cosmetics-fgd-pilot. Clone, configure, qa/eval. First real-opp exercise of the full 0.3.5+0.4.x stack. - -**P5 — External team-infrastructure (not ACE code):** -- Create a Connect shared knowledge collection on team `connect-ace`, record its id as `OCS_SHARED_COLLECTION_ID`, also set `OCS_LLM_PROVIDER_ID` + `OCS_EMBEDDING_MODEL_ID`. Until this happens, every bot clone inherits zero citations. Documented in ocs-agent-setup + ocs-chatbot-qa but can't be enforced until the collection exists. - -**P6 — Net-new capability (a cycle of its own):** -- **`fgd-synthesis` skill** — the "shareable-with-LEEP" narrative report Neal explicitly wants for FGD opps. Composes across N session transcripts + notes + audio: themes, representative quotes, decision-driver map, receptivity read. Biggest net-new gap and the actual deliverable of an FGD program. Deferred this cycle per user request ("improve core ACE first"). Priority should flip as soon as core is stable — without synthesis, FGD opps have no publishable output. - -### Closed - -(none from this run) - -### Skipped on this run (raised but not formally proposed) - -- **Iter 2: Nova brief quality** — checked `pdd-to-learn-app` + `pdd-to-deliver-app` for FGD archetype branches; both already solid (facilitation craft, session-documentation form). Per user's explicit guidance ("don't focus on Nova / CommCare app creation, another team owns that"), validated-and-moved-on rather than iterating. -- **state.yaml init for cosmetics-fgd-pilot specifically** — orchestrator hardening (Iter 7) now makes `/ace:step` robust to missing state.yaml, but the cosmetics-fgd-pilot Drive folder itself still lacks a state.yaml because I bypassed `/ace:run` at setup. Easy to init manually or by running `/ace:step idea-to-pdd cosmetics-fgd-pilot` once (which will now initialize defensively). Not worth a formal proposal — one-shot operational fix. -- **`generate_citations: false` on golden template** — the Iter 6 subagent considered setting this since there's currently nothing to cite on team `connect-ace`, but rejected it: per-opp bots will attach an opp-specific collection and want citations on by default. Same file, leave as-is. -- **Context refresh (`.claude/pm/context.md`)** — the "Tech Stack" line still said "5 agents, 19 skills, 4 commands" (now 8/24/10). Fixed the counts; larger refresh of the "Current State" paragraph (which references old PR #3 / stress-test observations) deferred — it's a context-hygiene cycle of its own, not a code-lens concern. - -### Meta-observations - -**What worked well:** - -- **Dispatching subagents for mechanical sub-work while keeping judgment in the main thread** paid off twice. (1) The 0.4.0 opp-eval skill was a 490-second solo subagent run against a detailed spec — produced a 380-line SKILL.md, fixture stubs, manifest entries, new command, and a clean PR. (2) Two parallel diagnostic dispatches (Iter 4 + Iter 5) compressed what would have been serial cycles; each came back with a ~400-word report and no main-thread context pollution. Rule: **delegate mechanical scope, keep design decisions in the main thread, brief the subagent like it's never seen the session.** -- **Running the pipeline against real content surfaces gaps that spec reviews miss.** The 2026-04-08 focus-group framework was designed carefully and shipped with medium-high confidence but *"real validation requires running the skills against the fixture in a Claude session, which I can't do from the implementing session"*. The first real run (cosmetics FGD Phase 1 recon) found the category-naming drift that shipped as Iter 1. Same for the qa/eval split: the first real run against the golden template found the `per_item` / `per_prompt` naming drift + the no-opp fallback gap + the MCP tool schema gap — none of which showed up in the 0.3.5 design phase. -- **Atomic PRs per fix kept the session manageable.** Four separate PRs (#33–#36) with clear scope each. Tests green between every PR. Zero cross-PR conflict. Each shipped as its own installable version (0.4.1 → 0.4.4) so the progression is replayable later. -- **Persistent backlog in Drive (plus in-repo CHANGELOG) during the session** served as a working-memory bridge between iterations. The Drive backlog surfaced each time I picked the next fix; the CHANGELOG entries captured durable rationale. The run log (this file) consolidates. -- **`drive_create_file` default to Google Doc conversion was transparent** — idea.md uploaded as a Doc rather than markdown, but `drive_read_file` exported it cleanly, so idea-to-pdd didn't choke. My initial concern didn't materialize. Worth noting as a positive signal: the MCP tool handles both cases. - -**What was wasteful:** - -- **MCP parameter-name confusion burned ~15 minutes** up front. Used `parentId` (wrong) then got the correct `parentFolderId` from the ToolSearch schema only after the user called out the Drive-quota error was a bogus red herring. The "Drive quota exceeded" error was a misleading symptom of the wrong-param (folder created in SA's own Drive root, not the shared drive). Rule for next time: **when an MCP tool errors on what looks like auth/quota, first verify schema via ToolSearch — the call may be hitting a different code path than intended.** -- **Dispatched the Iter 6 subagent and hit a rate limit mid-diagnosis.** 7 tool calls in, then `"You've hit your limit · resets 9am (America/Denver)"`. Unknown whether it made partial changes to the live template. Retry after reset (a few hours later) worked; found no user-facing drift from the prior attempt. But the blind-retry approach means any partial mutation could have caused confusion. Rule: **when dispatching subagents that modify production state, emit a checkpoint after each mutation so a retry has an auditable stopping point.** Candidate for canopy-skills. -- **Iteration log + backlog files in Drive got written once, then not updated.** Drift. The repo CHANGELOG was the real source of truth. **For a future iteration loop, either auto-update the Drive file each cycle or skip it entirely and rely on CHANGELOG + this run log.** Recommend the latter — fewer surfaces to keep in sync. - -**Prompt adjustments for next time:** - -- **Flag context.md as needing refresh when skill/command/agent counts drift.** `context.md` line 18 said "5 agents, 19 skills, 4 commands" for several cycles after the actual counts moved. It's low-cost to keep current if done during the cycle that changes the count; high-cost to discover stale later. Add a preflight step: when a cycle adds a skill or command, also bump the counts in `context.md`. -- **For multi-iteration cycles (N > 3 iterations in one session), write the backlog out to Drive / the run log at a midpoint.** This session ran 8 iterations in one sitting. Captured persistence at the end worked but would have failed gracefully if the session crashed mid-cycle. Either a midpoint flush or auto-append-on-merge would prevent loss. -- **Subagent path-substitution conventions should be in the skill contract from day one.** The no-opp fallback (`ACE/golden-template/`) wasn't in the original 0.3.5 qa/eval split, surfaced during Iter 4 testing, shipped in Iter 7 as a contract fix. Any skill that can run without an opp context needs its fallback path explicit on first ship. - -**Confidence on validation:** - -- **High on Iters 1, 3, 7 (text / contract edits).** Tests green; each is a small scoped SKILL.md or agent change with clear semantics. Manual re-run of the affected skills against cosmetics-fgd-pilot would confirm behavior, left for a dogfood cycle. -- **High on Iter 6 (golden template + bootstrap).** Before/after scores are concrete (3.84 → 8.2), re-ran qa/eval as proof. Bootstrap defense is validated against the same bad-state it was designed to catch (via the scrape helper that actually walks live OCS data). -- **Medium on the qa/eval split (0.3.5) + opp-eval (0.4.0).** Both shipped with passing tests and documented contracts, but the underlying integration points (real OCS chat, real verdict aggregation) got exercised only against the golden template (5 prompts) and against one partial opp (cosmetics-fgd-pilot, no verdicts). The 0.4.3 contract cleanup closed surfaced gaps; another real run (P4 backlog item) is needed to get to high confidence. -- **Low on orchestrator defensive init (Iter 7).** The rule is spec-level; no skill invocation has actually exercised the new `/ace:step` init path in this session. Next time `/ace:step` is invoked against an opp without state.yaml, we'll know. - -### Self-improvement (canopy-skills meta-PRs) - -Three candidates surfaced from this run's meta-observations: - -1. **"Subagent state-mutation checkpointing."** When a subagent modifies production state (live OCS template, Drive content, external API calls), instruct it to emit a checkpoint after each mutation — timestamped, structured, greppable — so a retry after interruption has an auditable stopping point. Today the Iter 6 retry worked without issue, but in a less-fortunate case the prior partial run could have left the template in a worse state than the baseline. Worth a universal-PR for the canopy product-management skill's subagent-dispatch section. - -2. **"ToolSearch before schema-guessing on MCP errors."** The Drive-quota red herring cost ~15 minutes because I inferred the wrong parameter name and read the error at face value. When an MCP tool errors on what looks like auth or quota, invoke ToolSearch for the exact schema before retrying. Small addition to canopy's debugging-skill guidance. - -3. **"Auto-append context.md on count-drift."** When a cycle adds a skill, agent, or command, update the count in context.md in the same commit that adds the artifact. Cheap to do inline; expensive to catch later. PM scout skill's Phase 5 (validate) should include a count-check. - -Beyond these three: the "archetype branching is a single-skill single-PR unit of work" pattern is confirmed (two shipped cleanly this session). The "qa/eval two-phase pattern" is now a first-class concept worth extracting from ACE into the canopy product-management guidance for any project where an artifact needs external exercising before judgment. diff --git a/.claude/pm/runs/2026-04-20-collection-clone-and-mcp-preflight.md b/.claude/pm/runs/2026-04-20-collection-clone-and-mcp-preflight.md deleted file mode 100644 index 07f4f31f..00000000 --- a/.claude/pm/runs/2026-04-20-collection-clone-and-mcp-preflight.md +++ /dev/null @@ -1,105 +0,0 @@ -## 2026-04-20 — collection-clone-and-mcp-preflight (custom lens) - -**Lens used:** "close out the source_usage gap on the golden template; ship items 2 + 3 from prior backlog as OCS-layer defense." Continuation of the 2026-04-19 iteration loop, driven by the `[WARN] source_usage: 5.0` finding from Iter 4 recon. - -**Background read:** `.claude/pm/runs/2026-04-19-qa-eval-iteration-loop.md` (prior cycle, backlog owners), `~/.ace/connect-ocs-bot.json` (production bot metadata), `docs/superpowers/specs/2026-04-08-ace-ocs-chatbot-buildout-design.md` (verification items 6, 7: team-scoping intent), `scripts/bootstrap-ocs-golden-template.ts` (prior-cycle collection-existence defense), `mcp/ocs/backends/{playwright,pipeline-patch}.ts`, `mcp/ocs-server.ts`, `test/mcp/ocs/{playwright-backend,pipeline-patch}.test.ts`. Mid-cycle: user clarified that `connect-ace` is AI-isolated from `ccc-support` for blast-radius containment (deliberate architectural choice), and that `chatbots.dimagi.com` is a legacy DNS alias for the same `openchatstudio.com` backend. - -**Core finding:** The "fix items 2 and 3" framing from the prior session's backlog was **anchored on a wrong premise** — the Iter 6 subagent's "collection 718 doesn't exist on connect-ace" finding led me to categorize it as team-infrastructure work. The user's challenge ("can we reference collections across teams? what team is the support bot on?") surfaced that 718 was **stale metadata** in `~/.ace/connect-ocs-bot.json` (dated 2026-04-09). The real "NM Bot" collection is id **135** on `ccc-support`. Two DNS names for one backend created secondary confusion. A one-day loop of Path C verification → Path B execution (subagent-driven clone to connect-ace, new collection 350) → MCP-layer defense against the same class of silent-block bug resolved the gap cleanly and closed backlog item 2 as redundant with item 3. - -### Do it - -1. **Path C verification — cross-team collection attach → publish** — Effort: S (exploratory) — Status: **done, no PR** - - 4 MCP tool calls: `ocs_get_chatbot_embed_info(11792)` → `ocs_attach_knowledge({collection_index_ids: [718]})` (ok: true at pipeline-patch layer) → `ocs_publish_chatbot_version` (**silent-block**: `HTTP 200 Version publish rejected: form re-rendered without redirect`) → revert: `ocs_attach_knowledge({collection_index_ids: []})` + republish to clean state. - - Outcome: **confirmed OCS enforces team scoping at publish, not attach.** Cross-team collection references are not supported by OCS. The attach layer accepts any id; publish layer validates and rejects without surfacing the error. Path A (move template to ccc-support) was unacceptable (user: `connect-ace` is deliberately AI-isolated from human-managed production); Path B (clone locally) was the remaining option. - -2. **Iter 8: Subagent clone of collection 135 (ccc-support) → 350 (connect-ace)** — Effort: M — Status: **done, live OCS state change, no code PR** - - Dispatched a general-purpose subagent with ~40 tool-call budget. Sessions probed: connect-ace valid; ccc-support valid. Collection 718 didn't exist (stale metadata); found 135 by name ("NM Bot"). Enumerated: 2 files (`AutoConnect_FAQs.docx`, `Support_bot_FAQ_ECD-KMC-CHC.docx`), 111 chunks, ~170 KB. Extracted `OCS_LLM_PROVIDER_ID=378` (OpenAI for Embeddings) and `OCS_EMBEDDING_MODEL_ID=1` (text-embedding-3-small) from template 11792's pipeline. Created collection 350 on connect-ace, uploaded files, waited for indexing (ready=true, 2 files indexed). Attached to template 11792, republished v4 (no silent-block this time). Verified end-to-end with 5 canonical `--quick` prompts — all returned high-quality Connect-knowledgeable content clearly sourced from the uploaded docs (e.g., walked through CommCare onboarding steps lifted verbatim). - - Side effects: env file write sandbox-blocked; user must manually add `OCS_SHARED_COLLECTION_ID=350`, `OCS_LLM_PROVIDER_ID=378`, `OCS_EMBEDDING_MODEL_ID=1` to `$CLAUDE_PLUGIN_DATA/.env`. - - Scope discipline: read-only contract with ccc-support honored; only GETs to its API. - -3. **0.5.1 — `publishChatbotVersion` pre-flight + `uploadCollectionFiles` chunk params** — Effort: M — Status: **done, shipped** - - PR: jjackson/ace#39 - - `mcp/ocs/backends/pipeline-patch.ts` — new `validatePipeline` helper round-trips the current graph through `/pipelines/data/` to surface node-level errors. New `extractPipelineErrors` helper handles both the legacy top-level `{errors: [...]}` shape and the nested `{errors: {node: {: {: }}}}` shape that OCS actually returns for node-level validation. `patchLlmNodeParams` now uses the same extractor — it had the same top-level-only blindspot that hid the phantom-collection bug on 2026-04-19. - - `mcp/ocs/backends/playwright.ts:publishChatbotVersion` — calls `validatePipeline` before `/versions/create`. The silent-publish-block class is now structurally impossible: every publish goes through the pre-flight, every node-level error surfaces as a typed `PipelineValidationError` naming the exact node + field. - - `mcp/ocs/backends/playwright.ts:uploadCollectionFiles` — sends `chunk_size` + `chunk_overlap` (required by Django's `add_collection_files` form; omitted before this cycle). Defaults 800/400 match upstream NM Bot. MCP tool schema exposes both as optional overrides; invalid values (overlap ≥ size) throw before HTTP. - - 12 new unit tests (89 total, up from 77). Covers both error-shape variants, pre-flight blocking, chunk-param passthrough, overlap-validation. - - **Backlog item 2 ("ocs-agent-setup pre-flight on OCS_SHARED_COLLECTION_ID")** dropped as redundant — the MCP-layer pre-flight catches the class at the bottleneck; every path that publishes goes through it. - -4. **`~/.ace/connect-ocs-bot.json` metadata refresh** — Effort: trivial — Status: **done (local file, not in repo)** - - `shared_collection_id: 718` → 135; `shared_collection_chunks: 148` → 111; `base_url` → canonical `www.openchatstudio.com` with `base_url_legacy_alias` for the DNS fallback; new `connect_ace_local_copy` pointer to collection 350; corrected source description (uploaded docs, not Confluence auto-sync). Prevents another ghost chase from stale metadata. - -### Backlog - -**P1 — User action (unblocks post-clone retrieval):** -- Append three lines to `~/.claude/plugins/data/ace-ace/.env`: - ``` - OCS_SHARED_COLLECTION_ID=350 - OCS_LLM_PROVIDER_ID=378 - OCS_EMBEDDING_MODEL_ID=1 - ``` - Without these, future opp clones go through `/ace:ocs-bootstrap-template` (which now defensively skips if env var is missing) but inherit no shared knowledge. Can't be done from within ACE-code path because the sandbox blocks writes to `$CLAUDE_PLUGIN_DATA/.env`. - -**P2 — Dogfood on cosmetics-fgd-pilot (was P4 prior cycle):** -- Now unblocked: golden template works, shared collection provisioned, MCP pre-flight live. Running `ocs-setup` end-to-end on cosmetics-fgd-pilot will exercise the full 0.3.5+0.4.x+0.5.x stack against a real opp for the first time. - -**P3 — `ocs_list_collections` MCP tool (was P1 prior cycle):** -- Still worth adding. `bootstrap-ocs-golden-template.ts` has a scrape-based `listCollectionIndexIds` helper; lift it to a first-class MCP tool. Would have let the Iter 8 subagent probe collection existence via tool instead of direct HTML scraping. - -**P4 — Archetype coverage audit (was P2 prior cycle):** -- Unchanged: `connect-program-setup`, `training-materials`, `llo-onboarding/uat/launch/feedback`, `app-test`, `flw-data-review` need FGD-branch audit. Small per-skill PRs. - -**P5 — Rubric proliferation (was P3 prior cycle):** -- Unchanged: add `## LLM-as-Judge Rubric` to more skills so `opp-eval` aggregates beyond `ocs-chatbot-eval`. Forcing function from prior cycle still standing. - -**P6 — Collection sync from ccc-support upstream:** -- Collection 350 is a point-in-time clone of ccc-support 135 as of 2026-04-20. Drift is inevitable. Options: (a) manual periodic refresh, (b) auto-sync from the same Confluence source if/when that source is added, (c) cross-OCS tooling. Deferred until drift becomes observable (connect content changes slowly; days-of-lag is fine). - -**P7 — `fgd-synthesis` skill (was P6 prior cycle):** -- Unchanged and deferred per user direction ("improve core first"). The "shareable-with-LEEP" narrative report is the biggest net-new capability for FGD opps, but core stability comes first. - -### Closed - -- **Item 2 from 2026-04-19 backlog: `ocs-agent-setup` pre-flight on `OCS_SHARED_COLLECTION_ID`.** Made redundant by the MCP-layer pre-flight in 0.5.1 — `publishChatbotVersion` now refuses any pipeline with validation errors, so per-opp bot creation is automatically protected. Skill-level duplication would be defense-in-depth but not worth the added complexity. - -### Skipped on this run (raised but not formally proposed) - -- **Cross-team collection support via OCS.** Verified impossible from our side (publish enforces team scope); any forward path requires an OCS feature request. Flagged as external-dependency, not in our roadmap. -- **`ocs_list_collections` MCP tool.** Would have prevented the 718 phantom-collection discovery from requiring a full subagent dispatch. Still worth doing (see P3) but not urgent. -- **Auto-validation of `~/.ace/*.json` metadata files.** The 2026-04-09 snapshot stayed authoritative in our reasoning for 11 days. Worth a "freshness probe" pattern — could be a canopy-skills universal candidate. -- **Updating 0.4.5 PM run log** to cross-reference today's discoveries. Deferred: cross-referencing is fine from this log pointing backward; forward-rewrite isn't worth it. - -### Meta-observations - -**What worked well:** - -- **User's direct challenge broke a wrong premise.** "How does having the template in ccc-support help? we still need to create the bots... in connect ace" cut through my Path A framing in one sentence. The subagent's "collection 718 doesn't exist on connect-ace" was technically true but misleading; I'd converted it into "needs team-infra work to create a new collection" without verifying. The user asked the right question first. Rule: **when a subagent reports a factual constraint, invert it — "is this actually true?" — before categorizing the work.** -- **Path C as a cheap disambiguation experiment.** 4 MCP calls to test "does OCS reject cross-team collection attaches?" beat hours of documentation reading or speculation. The revert was clean. Rule: **for cross-system questions with reversible consequences, run the experiment.** -- **Subagent Iter 8 checkpoint discipline.** The cloning subagent emitted checkpoints after each mutation (collection created, files uploaded, attached, published). This cycle's prior-attempt rate-limit (2026-04-19 Iter 6) validated the pattern: if interrupted, the next agent picks up from the last checkpoint instead of re-running potentially destructive steps. -- **Bounded subagent scope with explicit read-only contract.** "Read-only on ccc-support. Write allowed on connect-ace." The subagent honored it cleanly; no confusion about which team to modify. -- **MCP pre-flight as class-level defense, not instance-level.** Fixing one method (`publishChatbotVersion`) catches every future silent-block, not just today's phantom-collection case. Covers `ocs-agent-setup`'s per-opp clones, manual UI edits, future `ocs_attach_knowledge` misuses — every path that publishes. Item 2 dropping as redundant is the payoff. - -**What was wasteful:** - -- **Assumed "720 doesn't exist on team" was a complete finding when it was an ambiguous one.** The subagent correctly observed the UI state; I wrote up a backlog item on it without probing further. Propagated a wrong premise into the prior session's PM run log (now superseded by today's discovery). Rule: **a subagent's observation is an input to reasoning, not a conclusion. Challenge the factual framing before extending it.** -- **Didn't catch the MCP field-name + chunk-param bug in earlier cycles.** The 0.3.5 qa/eval split shipped tests against a fake request layer; the `uploadCollectionFiles` chunk-param omission wouldn't have shown in any of those because the tests weren't simulating Django's form-validation behavior. The bug surfaced only when the Iter 8 subagent actually tried to upload files to a real collection. Rule: **for HTTP backend code that integrates with a specific server's form validation, unit tests against mock requests can miss entire failure classes. Integration tests are load-bearing.** Not fixing today (no new integration tests shipped), but worth noting. - -**Prompt adjustments for next time:** - -- **When dispatching a subagent that modifies external production state, require it to report blocked operations explicitly.** The env-file write was sandbox-blocked; the subagent buried that in the report text rather than surfacing it as a top-level "USER ACTION REQUIRED." Took a re-read to notice. Add a convention: subagent report should have a dedicated "blocked / requires manual follow-up" section. -- **Metadata files (`~/.ace/*.json`) should be treated as hypotheses, not truths.** The 2026-04-09 connect-ocs-bot.json was trusted for 11 days without re-probing. Today's refresh is a one-shot; the pattern would be "before acting on metadata older than N days, probe its facts against live state." - -**Confidence on validation:** - -- **High on 0.5.1 (MCP pre-flight + upload chunks).** 89 tests pass. Every observed failure shape has a test. `publishChatbotVersion` pre-flight tested end-to-end at the mock-request layer, including the exact phantom-collection nested-error shape. -- **High on Iter 8 clone (live OCS state change).** Subagent ran the 5 canonical `--quick` prompts post-republish; all 5 returned high-quality Connect-knowledgeable responses. Content provenance is qualitatively clear (e.g., CommCare onboarding walkthrough details lifted verbatim from uploaded docs). A formal `ocs-chatbot-qa --deep` + `ocs-chatbot-eval --deep` score comparison (pre-clone 8.2 overall / source_usage 5.0 → post-clone ?) would be a satisfying capstone; left for the next cycle. -- **Medium on "team-scoping is enforced at publish-time only."** Verified via one experiment (Path C). Hypothesis confirmed but n=1. A second reversing experiment (attach a collection that DOES exist, confirm publish succeeds) would strengthen confidence; left as an imputed fact. - -### Self-improvement (canopy-skills meta-PRs) - -Two candidates: - -1. **"Metadata files as hypotheses, not truths."** When a PM cycle reads a JSON/YAML/env metadata file that anchors subsequent reasoning, tag it with a freshness check: probe the relevant fact against live state if the file is older than ~7 days. Today's stale `connect-ocs-bot.json` from 2026-04-09 cost a full exploratory cycle to unwind. Candidate for canopy's general PM-scout skill Phase 1 guidance. - -2. **"Subagent blocked-operations convention."** When a subagent dispatches on a mutation task, its report should have a dedicated "USER ACTION REQUIRED" section at the top for any step sandbox-blocked or otherwise unattainable. Today's env-file write was blocked and the line buried mid-report; a convention forces it into the summary. Candidate for canopy's subagent-dispatch guidance. - -Beyond these: "team-scoped resources enforced at mutation-publish time, not mutation-attach time" is OCS-specific but generalizes to "silent-accept-then-reject-on-commit" patterns across systems (SQL DDL, container orchestrators, etc.). Worth noting but not its own canopy rule. diff --git a/.claude/pm/runs/2026-04-20-dead-env-vars.md b/.claude/pm/runs/2026-04-20-dead-env-vars.md deleted file mode 100644 index c14ce25f..00000000 --- a/.claude/pm/runs/2026-04-20-dead-env-vars.md +++ /dev/null @@ -1,92 +0,0 @@ -## 2026-04-20 — adoption-blockers (dead env vars, second pass) - -**Lens used:** adoption-blockers, explicitly re-run at user request after the 2026-04-20 morning env-drift cycle shipped well. Applied the key learning from that run: **read primary sources directly, don't trust tooling**. Step 1 was `diff <(keys from installed .env) <(keys from .env.tpl)` — confirmed the env-drift class (keys missing from env) was cleanly closed by 0.5.4–0.5.5 (16/16 match on my box). - -**Background read:** `.claude/pm/context.md`, `.claude/pm/learnings.md` (31 entries now), prior runs `2026-04-19-qa-eval-iteration-loop.md` and `2026-04-20-env-drift-adoption-blockers.md`, `.env.tpl`, installed `~/.claude/plugins/data/ace-ace/.env`, `bin/ace-doctor`, `commands/{setup,doctor,ocs-login,ocs-bootstrap-template,run}.md`, `README.md`, `scripts/bootstrap-ocs-golden-template.ts`, `skills/ocs-agent-setup/SKILL.md`, `playbook/integrations/ocs-integration.md`. - -**Core finding — the inverse subclass:** `env_drift` closed "keys in `.env.tpl` but missing from installed `.env`." The inverse subclass — **keys declared in `.env.tpl` with zero consumers in code** — was unaudited. `grep -rE "\b$KEY\b" mcp lib scripts skills bin hooks agents commands test` surfaced four dead vars: - -- `OCS_GOLDEN_TEMPLATE_PUBLIC_ID` — printed by bootstrap, README step 6 tells user to paste into `.env`, **never read by any code.** The per-opp `ocs-agent-setup` skill retrieves its own `public_id` via `ocs_get_chatbot_embed_info` after cloning. The golden template's public_id is never referenced at runtime. -- `OCS_GOLDEN_TEMPLATE_EMBED_KEY` — same pattern, same dead code path. -- `OCS_PROD_TEAM_SLUG` — declared, injected from 1Password, zero consumers anywhere. -- `ACE_SESSION_STATE_DIR` — declared with value `~/.ace`, but every consumer hardcodes `path.join(os.homedir(), '.ace', ...)` rather than reading this var. - -**Secondary finding:** `.env.example` is stale and redundant. Missing `ACE_DRIVE_ROOT_FOLDER_ID` (added 0.5.3) and `OCS_PROD_TEAM_SLUG`. Only one doc (`playbook/integrations/ocs-integration.md:19`) still pointed at it. Two-file pattern was a holdover from pre-1Password setup. - -**Tertiary finding:** bootstrap output tells user to paste values into `.env`, but `.env.tpl` has `OCS_GOLDEN_TEMPLATE_ID` as an `op://` reference. Any future re-inject (triggered e.g. by `env_drift` on a new var) silently reverts the pasted value to whatever's in 1Password. The docs themselves contradicted the architecture. - -### Do it - -1. **P1 — Delete dead env vars + add class-level `unused_env_keys` doctor check** — Effort: S+ — Status: **done, pushed (PR #49)** - - Removed `OCS_GOLDEN_TEMPLATE_PUBLIC_ID`, `OCS_GOLDEN_TEMPLATE_EMBED_KEY`, `OCS_PROD_TEAM_SLUG`, `ACE_SESSION_STATE_DIR` from `.env.tpl` - - Updated `scripts/bootstrap-ocs-golden-template.ts` header comment (`.env.example` → `.env.tpl`), removed public_id/embed_key print lines - - Updated `README.md` First-Run step 6 + `commands/ocs-bootstrap-template.md` expected-output block - - Added `unused_env_keys` check to `bin/ace-doctor`: for each `KEY=` in `.env.tpl`, greps `mcp lib scripts skills bin hooks agents commands test`; WARNs on any key with zero consumers. Informational (WARN, not FAIL) per the 2026-04-15 learning. - - Verified on-box: doctor now reports `PASS unused_env_keys: every .env.tpl key has at least one consumer`. Tests still 89/89. - -2. **P2 — Delete `.env.example`** — Effort: S (trivial) — Status: **done, same PR** - - `rm .env.example` - - `playbook/integrations/ocs-integration.md:19` now points at `.env.tpl` - - `.gitignore` comment updated - -3. **P3 — Bootstrap output reframed: 1Password is source of truth** — Effort: S — Status: **done, same PR** - - `scripts/bootstrap-ocs-golden-template.ts` "Add to your ACE .env:" block replaced with: (1) `op item edit "ACE - Open Chat Studio" "Config.golden_template_id[text]=" --vault AI-Agents --account dimagi.1password.com`, (2) `op inject -i .env.tpl -o ~/.claude/plugins/data/ace-ace/.env --account dimagi.1password.com`, (3) `/reload-plugins`. The existing-template path (when bot already exists, not force) prints experiment_id + public_id with a note that no vault change is needed. - - `commands/ocs-bootstrap-template.md` expected-output block mirrors the new script output, with a rationale paragraph explaining why local paste-to-`.env` fights `op inject`. - - Per user direction: only fires on new/replaced template; existing-template path just echoes the id. - -### Backlog - -Carried forward from 2026-04-20 earlier (unchanged): - -**P3** — `ocs_list_collections` MCP tool -**P4** — Archetype coverage audit (remaining silent atomic-visit defaults) -**P5** — Rubric proliferation -**P6** — Collection sync from ccc-support upstream (deferred) -**P7** — `fgd-synthesis` skill (deferred per user direction) - -New from this cycle: - -- **P8 — Wire up `ACE_SESSION_STATE_DIR` as a real knob, if someone ever needs it.** Removed as dead code here; re-add with a real consumer (`mcp/ocs/auth` reads it, falls back to `~/.ace`) when a concrete use case surfaces. Probably never — `~/.ace` is fine. -- **P9 — Doctor `unused_env_keys` check could also verify non-empty values** for required keys (analogous to how `drive_root` and `ocs_shared_collection` do today). Deferred — the current check catches the class we care about most (declared but unread). If a user ever injects with an empty 1Password field, the existing per-var checks catch the common cases. - -### Closed - -- **None new.** Same rationale as the 2026-04-20 earlier cycle: the class-level preventer (`unused_env_keys`) generalizes today's finding — any future dead-var addition gets caught automatically. Future proposals of the "remove $FOO, it's dead" shape should be redundant with doctor. - -### Skipped on this run (raised but not formally proposed) - -- **0.5.6 missing CHANGELOG entry.** Commit `f45549d` (llo-invite phase move) bumped VERSION to 0.5.6 but added no CHANGELOG stanza. Noticed while editing CHANGELOG.md for 0.5.7. Out of scope for this PM cycle; trivial fix if it lands in the next PR. -- **Worktree node_modules FAIL in `/ace:doctor --here`.** Dev ergonomics issue — worktrees don't inherit the parent repo's `node_modules`. Ran `npm test` successfully regardless (vitest found somehow). Not an adoption-blocker for operators; DX friction for contributors. Defer unless it blocks a contributor. -- **`scripts/bootstrap-ocs-golden-template.ts` step labels are inconsistent** (`[1/5]`, `[2/5]`, `[3/5]`, `[4/6]`, `[5/6]`, `[6/6]`) — pre-existing drift from when step 5 was added. Not adoption-blocking, trivial cosmetic fix if someone is touching the script anyway. - -### Meta-observations - -**What worked well:** - -- **The class-level grep was the whole cycle in one command.** `grep -rE "\b$KEY\b" mcp lib scripts skills bin hooks agents commands test` against each `.env.tpl` key surfaced all four dead vars in under a minute. Made the scout feel mechanical — exactly the mode I want adoption-blockers scouts to be in. -- **Caught a bug in my own doctor check before shipping.** First draft of `unused_env_keys` used `grep -v '\.env'` to exclude env-file hits; that accidentally filtered `process.env.FOO` in JS/TS consumers. The doctor output flagged `OCS_USERNAME` and `OCS_PASSWORD` as unused — which I *knew* had a consumer at `mcp/ocs-server.ts:81-82`. Noticed the discrepancy, removed the broken filter, check now reports accurately. **Lesson: when a doctor check reports surprising results, don't ship — trace the disagreement first.** -- **Reading the script's own output block surfaced the P3 finding.** I wasn't planning to propose P3 until I re-read bootstrap's print statements while deleting the two dead lines. The "paste into .env" advice jumped out against the backdrop of the 2026-04-20 "vault values are hypotheses too" learning. Same pattern as the morning cycle: learnings.md compounds, lens-specific scouts pick up items that would otherwise drift. - -**What was wasteful:** - -- **I paused mid-scout to ask about OCS MCP and 22 capabilities.** User (correctly) steered me back to adoption-blockers. Lens discipline: stay in the lens until scout is complete, THEN consider related explorations as separate proposals. Adoption-blockers scout is cheap and bounded — don't fuse it with an integration-depth scout. -- **Debugging the grep in the doctor check took 3–4 tool calls** because I was running under zsh (word-splitting differs from bash). Wasted tokens tracing the wrong hypothesis. **Lesson: when a shell construct behaves unexpectedly, check `$BASH_VERSION` before assuming the script is wrong.** The doctor runs under bash (`#!/usr/bin/env bash`); my ad-hoc debug calls were under zsh (Bash tool's default). Different semantics for unquoted `$var` word-splitting. - -**Prompt adjustments for next time:** - -- **For adoption-blockers scouts, run both the forward and inverse diff.** Morning cycle did forward (tpl → env, keys missing from install). This cycle did inverse (tpl keys → codebase, keys without consumers). Both are one-line greps; do them together every time. Candidate for canopy update. -- **When adding a doctor check, unit-test it against a known-good case before shipping.** The `grep -v '\.env'` bug would have been caught by "verify that `OCS_USERNAME` shows up in consumers before trusting the output." Applies more broadly: any diagnostic that emits PASS/WARN/FAIL should be exercised against both states at authoring time. Related learning from 0.5.4 follow-up ("test the hint actually runs end-to-end"), extended to "test the check actually catches the thing it claims to catch." - -**Confidence on validation:** - -- **High on dead-var removal.** Verified by full-tree grep after removal: the four removed vars have zero remaining references in code paths (only `docs/superpowers/{specs,plans}/*` historical refs remain, explicitly frozen per CLAUDE.md). -- **High on `unused_env_keys` check.** Reports `PASS` against the cleaned `.env.tpl`; reports accurate `WARN` list if dead vars are re-added (verified by intentionally re-adding and checking output). Tests still 89/89 — the check is bash-only, no test surface change. -- **Medium on P3 reframe.** Not exercised against a live re-bootstrap. Would be strengthened by running `OCS_BOOTSTRAP_FORCE=1 npx tsx scripts/bootstrap-ocs-golden-template.ts` and visually verifying the new output block reads correctly. Left for manual post-merge smoke if anyone refreshes the template. - -### Self-improvement (canopy-skills meta-PRs) - -Two candidates: - -1. **"For adoption-blockers scouts, run both the forward AND inverse diff."** Morning cycle's canopy candidate was "Step 1: diff installed config against template (forward)." Extend to: "Step 1a — forward diff: what keys in `.template` are missing from installed? Step 1b — inverse sweep: what keys in `.template` have zero consumers in the code tree (grep -rE for each key against code-bearing dirs)?" Both surface adoption-blocker classes; both are cheap; together they close the dead-var subclasses at the template boundary. - -2. **"Before shipping a diagnostic (doctor / health check / CI gate), exercise it against both PASS and FAIL states."** The 2026-04-20 follow-up rule ("run the `fix:` command at least once before landing") covers the remediation side. This cycle's bug — a grep filter that accidentally excluded real consumers — is the complementary failure: the check itself was wrong, reporting false positives. Rule addition: **any FAIL/WARN path should be exercised with a known-good input at authoring time** to confirm it doesn't misfire. Candidate for canopy's adoption-blockers lens AND the broader "writing checks" guidance if there is one. diff --git a/.claude/pm/runs/2026-04-20-env-drift-adoption-blockers.md b/.claude/pm/runs/2026-04-20-env-drift-adoption-blockers.md deleted file mode 100644 index 43676892..00000000 --- a/.claude/pm/runs/2026-04-20-env-drift-adoption-blockers.md +++ /dev/null @@ -1,86 +0,0 @@ -## 2026-04-20 — adoption-blockers (env-drift and smart-default follow-through) - -**Lens used:** adoption-blockers. Recent cycles (2026-04-19, 2026-04-20 earlier today) hit trust-reliability and integration-depth hard; PR #41 landed zero-arg `/ace:run` smart defaults a few hours prior. Natural moment to audit remaining first-run friction on the happy path now that the happy path just got happier. - -**Background read:** `.claude/pm/context.md`, `.claude/pm/learnings.md`, most recent prior run `2026-04-20-collection-clone-and-mcp-preflight.md`, `commands/run.md`, `commands/doctor.md`, `agents/ace-orchestrator.md` § Starting a New Opportunity, `bin/ace-doctor`, `.env.tpl`, and (key cross-check) the installed `.env` at `~/.claude/plugins/data/ace-ace/.env`. - -**Core finding (single unlock of the cycle):** the installed `.env` on the author's machine was 9 KEY= lines while `.env.tpl` ships 16. Missing keys included `ACE_DRIVE_ROOT_FOLDER_ID` — the variable PR #41's smart-default PDD picker depends on — and the shared-collection triple (`OCS_SHARED_COLLECTION_ID`, `OCS_LLM_PROVIDER_ID`, `OCS_EMBEDDING_MODEL_ID`) that was flagged P1 in the 2026-04-20 earlier run log. `/ace:doctor` reported `STATUS: COMPLETE` nonetheless — it only validated 3 of the 16 template keys. The adoption-blocker class is **`.env.tpl` drifts forward across releases; installed `.env` files don't auto-update; doctor doesn't notice**. Every admin who injected `.env` before these vars were added gets silent failures on the happy path with no signal of what's wrong. - -### Do it - -1. **P1 — `bin/ace-doctor` env-drift diff + specific checks for smart-default + shared-collection vars** — Effort: S — Status: **done, merged** - - PR: jjackson/ace#42 - - Three new checks in `bin/ace-doctor`: - - `drive_root`: explicit WARN when `ACE_DRIVE_ROOT_FOLDER_ID` is unset (PDD auto-discovery disabled; fix hint points at `op inject`, noting the var was added in 0.5.3). - - `ocs_shared_collection`: explicit WARN when any of the triple is missing (per-opp bots will have empty RAG; fix hint notes added in 0.5.1). - - `env_drift`: diff the `KEY=` set in `$ROOT/.env.tpl` against installed `$ENV_FILE`; WARN with the full list of missing keys if any, plus the canonical `op inject` fix command. Catches every future addition automatically — no code change needed when `.env.tpl` grows. - - Verified on the author's machine: all three WARNs fire correctly; `env_drift` lists 8 missing keys. Tests green (89/89). - -2. **P2 — `/ace:run` PDD picker fails loudly on unset `ACE_DRIVE_ROOT_FOLDER_ID`** — Effort: S — Status: **done, merged (same PR)** - - `agents/ace-orchestrator.md` § Starting a New Opportunity, new step 2(c).0 added before the `drive_list_folder` call: check the env var, stop with an explicit error naming `op inject` (or `--idea FILE|-` as a bypass) if unset. Do NOT fall through to (d). - - `commands/run.md` short version kept in sync — step numbering re-flowed to 1–6. - - Complements P1: doctor catches it preventively, this catches it at the use site for operators who skip doctor. - -3. **P3 — Docs hygiene: Quick Start + First-Run step 8 + doctor next-step hint** — Effort: S (trivial) — Status: **done, merged (same PR)** - - `README.md` Quick Start block: zero-arg `/ace:run` is now the lead example, named-slug variant demoted to second line. - - `README.md` First-Run step 8: `/ace:run --dry-run` (zero-arg) as primary, named-slug variant parenthetical. - - `commands/doctor.md` post-PASS hint: points at zero-arg `/ace:run` instead of the opp-name-required form. - -### Backlog - -Carried forward (not addressed this cycle — these were not what the adoption-blockers lens surfaced): - -**P1 — User action (unblocks post-clone retrieval, still outstanding from 2026-04-20 earlier):** -- `OCS_SHARED_COLLECTION_ID=350`, `OCS_LLM_PROVIDER_ID=378`, `OCS_EMBEDDING_MODEL_ID=1` need to be appended to `~/.claude/plugins/data/ace-ace/.env`. Actually, the `env_drift` check this cycle surfaces that the .env IS missing these in addition to other keys — so a clean `op inject` would cover it. Either way, user action needed. Noted but not "closed" until user acts. - -**P2 — Dogfood on cosmetics-fgd-pilot** (unchanged, from 2026-04-20 earlier). - -**P3 — `ocs_list_collections` MCP tool** (unchanged, from 2026-04-20 earlier). - -**P4 — Archetype coverage audit** (unchanged). - -**P5 — Rubric proliferation** (unchanged). - -**P6 — Collection sync from ccc-support upstream** (unchanged, deferred). - -**P7 — `fgd-synthesis` skill** (unchanged, deferred per user direction). - -### Closed - -- **None new this cycle.** The doctor `env_drift` check generalizes all future one-off "add a check for $NEWVAR" entries, so none of today's proposals need to go on the "don't propose again" list — they weren't "fix one thing," they were "add a class-level preventer." If a future scout proposes "add doctor check for new var X," the correct response is to verify `env_drift` already catches it (it should) and close the proposal as redundant. - -### Skipped on this run (raised but not formally proposed) - -- **`/ace:setup` mirroring the env_drift diff at setup time.** Complementary to the doctor check but overlaps. Left as candidate if users consistently skip `/ace:doctor` after update. -- **Bootstrap script writing golden-template values back to `.env` automatically.** Currently user manually copy-pastes 3 vars after `/ace:ocs-bootstrap-template`. Adoption-blocker adjacent, but not on the zero-config happy path — a fresh inject already has them from 1Password. -- **Connect-labs install check in doctor.** Already checked (`connect_labs: available`), so no gap. - -### Meta-observations - -**What worked well:** - -- **Trusting doctor output during the scout was the wrong move — reading the installed `.env` directly was the right move.** My initial scan saw `STATUS: COMPLETE` and almost moved on. Only when I compared `wc -l .env.tpl` (16 entries) against the installed `.env` (9 entries) did the gap surface. Rule: **the tool you're auditing is not a trustworthy oracle for the adoption-blockers lens. Read primary sources (the `.env` file itself, the `.env.tpl` file itself) before trusting any "everything is green" indicator.** -- **Class-level preventer beat instance-level fix.** The first draft of P1 was "add a check for `ACE_DRIVE_ROOT_FOLDER_ID`." The second draft generalized it to "diff `.env.tpl` against `.env`." The second is strictly better — catches every future var without code change — and only costs ~15 more lines of bash. Same pattern as the 2026-04-20 earlier cycle's MCP-layer pre-flight (one bottleneck catches everything downstream). -- **Three small proposals, one PR, one release.** Adoption-blockers surface as cluster finds. Shipping P1+P2+P3 together made sense because they all pointed at the same underlying drift and reinforced each other (doctor + use-site pre-flight + updated doc pointers). - -**What was wasteful:** - -- **Branch hygiene accident.** When merging to main via the sibling-checkout workflow, I ran `git pull --rebase` without verifying the main checkout was actually on `main` — it was on the stale `feat/run-smart-defaults` branch (leftover from PR #41). The merge landed there, the push created a new ref on origin reviving the just-deleted branch. Recovery was clean (fast-forward main was possible because `a963239` had `71ddc28` as an ancestor), but the accidental revived remote branch required the user to delete manually (sandbox blocks destructive remote pushes). **Rule: before `git pull --rebase && git merge && git push` in the sibling main checkout, ALWAYS `git branch --show-current` first and `git checkout main` if needed.** Candidate for a CLAUDE.md gotcha addition if this happens again. - -**Prompt adjustments for next time:** - -- **For "adoption-blockers" scouts specifically, always diff the current installed `.env` against `.env.tpl` upfront** as a Step 1 artifact. This cycle's entire finding would have been a single command output in Phase 1. -- **When updating the main checkout from a worktree, dedicate a single bash chain that explicitly checks current branch first.** Something like `cd ~/emdash-projects/ace && [ "$(git branch --show-current)" = "main" ] || git checkout main && git pull --ff-only && git merge --no-ff && git push`. Put this in CLAUDE.md § Git worktrees and merging to main as the canonical form. - -**Confidence on validation:** - -- **High on 0.5.4 shipped changes.** All three WARN paths tested on the author's machine — drive_root + env_drift fire, ocs_shared_collection passes (author manually added those per 2026-04-20 earlier P1). Tests green at 89/89 (unchanged — no new tests needed; the bash checks are validated by running doctor, and the orchestrator / commands / README changes are prompt-only). A negative test (restoring a missing var and re-running doctor to see the WARN clear) would strengthen but is implied by the diff working in both directions. -- **Medium on "this class is actually closed."** The `env_drift` check fires when keys are missing from `.env`, but if a user *manually* adds a key with an empty or placeholder value, the check won't catch it. Today's specific-var checks (`drive_root`, `ocs_shared_collection`) do validate non-emptiness, but that's only for those three. A follow-up could lift `env_drift` to also warn on keys whose value is empty or still an `op://` reference — left for if the simpler check proves insufficient. - -### Self-improvement (canopy-skills meta-PRs) - -One candidate: - -1. **"For adoption-blockers scouts, always diff the installed config against the template upfront."** Many Claude-Code-style projects have `.env` / `.env.example` / similar patterns where the template ships in-repo and the installed copy drifts across releases. Universal pattern worth adding to canopy's adoption-blockers lens guidance: "Step 1 — list every `.tpl` / `.example` / template file in the repo; for each, diff the set of keys/sections against the installed / active counterpart. Missing keys in the installed copy are adoption blockers even if the tooling reports green." Would have shaved this cycle's scout time from ~15 minutes to ~2. Candidate for canopy's `product-management` skill, Phase 1 "Exploration Lenses → adoption-blockers" bullet list. - -Beyond that: the "current git branch check before main-checkout merge" is specific enough to this repo's emdash-worktrees layout that it's better in CLAUDE.md than in canopy. diff --git a/.claude/pm/runs/2026-04-28-turmeric-dogfood-ocs-contracts.md b/.claude/pm/runs/2026-04-28-turmeric-dogfood-ocs-contracts.md deleted file mode 100644 index cbd300fd..00000000 --- a/.claude/pm/runs/2026-04-28-turmeric-dogfood-ocs-contracts.md +++ /dev/null @@ -1,151 +0,0 @@ -## 2026-04-28 — turmeric-dogfood-ocs-contracts (custom lens) - -**Lens used:** "First real end-to-end dogfood run of `/ace:run` against a clean PDD (turmeric market survey). Self-improve loop, not external demo: surface platform gaps in the live skill chain, fix what's atomic, document what isn't." Picked because the 2026-04-19 backlog item P4 ("Dogfood: real Phase 4 on cosmetics-fgd-pilot") had been sitting open and the spec assumed a fully-wired Phase 2; we wanted to find out what actually broke when the orchestrator tried it end-to-end on real content. - -**Background read:** `CLAUDE.md`, `agents/ace-orchestrator.md`, `commands/run.md`, `mcp/google-drive-server.ts`, `mcp/ocs/{capability-map,backends/{rest,playwright,pipeline-patch}}.ts`, `mcp/ocs-server.ts`, `skills/{ocs-agent-setup,ocs-chatbot-qa,ocs-chatbot-eval,opp-eval,llo-invite}/SKILL.md`, prior PM runs `2026-04-19` (qa-eval-iteration-loop) and `2026-04-20` (collection-clone, env-drift, dead-env-vars). Mid-cycle: `docs/examples/pdd-turmeric-market-survey.md` as the test PDD. - -**Core finding:** Three classes of platform bug — each invisible to spec review and only surfaced by running the chain end-to-end against real content. Two shipped this cycle (0.5.18, 0.6.1). The third (OCS pipeline-binding semantics) was discovered too late to fix on the same dispatch and is next cycle's #1. The dogfood validated, again, that *"real run > spec review"* — same observation logged in the 2026-04-19 cycle. Composite eval score 6.5/10 on the first run (Source-Usage 1/10 — RAG functionally broken via wrong-domain shared collection); re-run blocked at Step 7 by the new partial-save bug, so the "did fixing collection 350 fix RAG?" hypothesis remains untested. - -### Do it - -1. **Drive Shared-Drive guard — `parentFolderId` required + `driveId` pre-flight on `drive_create_*`** — Effort: S — Status: **done, shipped 0.5.18** - - Commit: `c44d1a4` - - Outcome: `drive_create_file` and `drive_create_folder` were silently falling back to the SA's My Drive root when the parent ID wasn't on a Shared Drive — every subsequent file write into that folder failed with the misleading "user storage quota exceeded." Surfaced 30 seconds into the first dogfood (`turmeric-dogfood-20260427` folder created with parent `0AJBkBzDqVEdoUk9PVA` = SA My Drive root, not the configured ACE Shared Drive `0AIUhETtpTlpcUk9PVA`). Class-level preventer added: `assertParentOnSharedDrive` helper does one `files.get` probe for `driveId`; missing = typed actionable error before any write. `parentFolderId` made required in tool schema (was `.optional()` with the footgun copy *"omit to create in root"*). Plus a `/ace:doctor` `drive_shared` canary check, plus orchestrator + skills/README copy clarifying the contract. The 0.5.1 publish-block fix's pattern (catch silent class at MCP boundary) reused exactly. 7 files changed, 230 insertions; tests green, drive write smoke test green. - -2. **OCS contract hygiene — `experiment_id` in list/get + `attach_knowledge` pre-flight** — Effort: M — Status: **done, shipped 0.6.1** - - Commit: `7e25ef5` - - Outcome: Two OCS contract bugs surfaced when Phase 4 ran. (a) `ocs_list_chatbots` and `ocs_get_chatbot` returned UUID `id` only, but every authoring atom (`set_chatbot_system_prompt`, `attach_knowledge`, `publish_chatbot_version`, …) requires the integer `experiment_id`. The skill's idempotency contract ("if a bot for this opp exists, reconfigure it instead of cloning") was unachievable in practice — the orchestrator hit this mid-run and had to clone an `-resume` variant, leaving an orphan reachable via the public widget API. Fix: parse `experiment_id` from each result's `url` field (`/chatbots/(\d+)/`), surface alongside `id`. (b) The golden template's `LLMResponseWithPrompt` node silently rejects `attach_knowledge` if the system prompt doesn't include the `{collection_index_summaries}` template variable — same Iter-6 silent-failure class as the 0.5.1 phantom-collection bug. Fix: pre-flight in `attachKnowledge` reads the current prompt and throws a typed `PipelineValidationError` naming the missing token before any patch attempt; detach paths (`collection_index_ids: []`) skip the check. Plus `skills/ocs-agent-setup/SKILL.md` step 2 (idempotency via `experiment_id`) and step 7 (require token in prompt) updated. 13 files, 243 insertions; 105/105 unit tests pass. - -### Backlog - -Prioritized; P1 is the unblocker for the entire next cycle. - -**P1 — `ocs_set_chatbot_system_prompt` partial-save bug (next cycle):** -- The same Iter-6 class has a sibling bug we didn't fix this cycle. Symptom: setting a prompt that contains `{collection_index_summaries}` fails with *"collection_index_summaries variable is specified, but collection_index_summaries is missing"* whenever the LLM node's input-binding map for that variable has been cleared (which happens when you publish a version with `collection_index_ids: []`). Hypothesis from the orchestrator: `set_chatbot_system_prompt` does a partial save that doesn't carry the collection-binding state; OCS's pipeline-save runs cross-field validation that requires every `{var}` in the prompt to correspond to a node-input binding, so the partial payload always rejects. Order-of-operations workarounds don't help — chicken-and-egg. **This blocks every Phase 4 re-run.** Fix needs to be either (a) `set_chatbot_system_prompt` reads the existing graph and merges binding state back into the save, or (b) a new transactional `set_pipeline_state` atom that does prompt+collections in one save. Reproducer: the 0.6.1 cleanup attempt on the golden template (see `comms-log/observations.md` § "NEW PLATFORM BUG"). - -**P2 — `ocs_archive_chatbot` atom (Playwright path):** -- Cleanup of orphan bots accumulated three this cycle on team `connect-ace`: `20f7fe39-…` (original turmeric clone), `ce90d4db-…-resume` (exp 11996, the bot that got the 6.5 eval), and exp 12000+12003 (the 0.6.1 re-clones). Both `DELETE /api/experiments//` (405) and `POST .../archive/` (404) failed via REST — the OCS archive flow is admin-UI only. Needs a Playwright atom that scrapes the archive form. ~20 lines. Without it, every test cycle leaves debris reachable via the widget API. - -**P3 — Re-cleanup the golden template after P1 lands:** -- The dogfood cleanup ran `attach_knowledge([])` + `publish` on the golden template (exp 11792) to detach the wrong-domain shared collection 350. That **did** detach but cleared the `{collection_index_summaries}` node-input binding, breaking subsequent prompt patches. Had to revert by re-attaching 350, so 350 is **still attached** to the golden template and clones still inherit cross-domain immunization content. Once P1 makes binding-preservation safe, redo the detach properly and republish. - -**P4 — Re-run Phase 4 dogfood after P1 + P3:** -- The hypothesis the 0.6.1 re-run was designed to test ("if RAG was contaminated by 350, source_usage jumps from 1/10 to ≥7/10 after detach") remains untested. Once the prompt-save path works and the golden template is clean, clone a fresh per-opp bot, run `ocs-chatbot-qa --deep` against the existing 28 test prompts, and read the new verdict. The composite from the first run was 6.5; the bar is 7.0+. If source_usage doesn't move, the issue is in the per-opp collection ingestion itself (P5). - -**P5 — Investigate per-opp collection ingestion (deferred, needs diagnosis):** -- The first run's deep transcript surfaced ~15/28 prompts where the bot self-disclosed *"corrupted binary content"* / *"encoded/binary content"* / *"garbled binary data"* — i.e., the bot saw the per-opp collection's chunks but couldn't decode them. Possible root causes: markdown files uploaded with wrong MIME, Drive's auto-Doc-conversion mangling the text, or OCS's chunker mishandling something specific to the way `ocs-agent-setup` step 5 base64-encodes content. Probe locally with a small test collection first; don't burn another deep-eval run guessing. - -**P6 — `state.yaml` and other YAML artifacts written as Google Docs (separate class):** -- Logged in `comms-log/observations.md` from the first run: `drive_create_file` defaults to `mimeType: application/vnd.google-apps.document` regardless of intent, so `state.yaml` round-trips through Doc auto-formatting (re-indented lists, smart quotes, etc.) every time the orchestrator writes it. The structural-integrity risk is real: a bot or operator that hand-edits state.yaml in Drive then has the orchestrator overwrite it could lose changes. Fix: drive_create_file should infer mimeType from filename extension (`.md` → text/markdown, `.yaml` → text/yaml, etc.), or accept an explicit `mimeType` arg. Smaller scope than P5; could be a separate small PR. - -### Closed - -(none from this run — no prior backlog items dispositioned) - -### Skipped on this run (raised but not formally proposed) - -- **`/ace:eval --deep` on `turmeric-dogfood-20260427`** — wanted to run the umbrella aggregator after Phase 4 to establish a verdict-rolled scorecard baseline. Skipped at user request to maintain momentum on shipping the contract fixes; the per-skill verdicts are already on Drive and `opp-eval` can roll them at any time later. Re-run-able cheap. -- **`canopy:pm-scout` on this session** — the natural follow-up. Decided to write the run log directly instead since the lens was custom and the findings were already crystallized by the time we wrapped. Future scouts can use this log as starting context. -- **Phase 5 (`llo-invite` → `llo-onboarding`)** — explicitly out of scope per the dogfood framing (no LLO-facing actions). Halted at the `ocs-chatbot-eval-deep` gate as planned. -- **Investigating Nova plugin integration** — PR #59 landed mid-cycle (`feat(commcare): drive Phase 2 via the Nova plugin (0.6.0)`), so Phase 2 is no longer HITL. Did not exercise it on this run because re-running Phase 2 wasn't the goal. Worth a separate Phase 2 dogfood run once a Nova-aware operator is at the keyboard. - -### Meta-observations - -**What worked well:** - -- **Real run > spec review (again).** Same observation as 2026-04-19. The Drive Shared-Drive footgun, the experiment_id contract drift, the `{collection_index_summaries}` requirement, and the partial-save binding bug all looked correct on paper — every one of them only surfaced when the orchestrator actually ran end-to-end against a real PDD. The spec review for `mcp/google-drive-server.ts` had passed multiple times with `parentFolderId: z.string().optional()` and the misleading *"omit to create in root"* description; only the dogfood found that an SA-backed deploy can never safely fall back to root. Rule (already a learning, reinforced): when adding a feature that's load-bearing for live runs, run it against real content before claiming done. -- **Class-level preventers carried forward from 0.5.1 + 0.5.7.** Both shipped fixes used the same shape: catch the silent-failure class at the MCP boundary with one cheap probe, surface a typed actionable error. The Drive `assertParentOnSharedDrive` and OCS `attach_knowledge` pre-flight are direct descendants of the 0.5.1 `validatePipeline`. The pattern is now a durable habit; we hit it instinctively rather than designing it from scratch. -- **Walkthrough-style review-mode operator gating.** Halting at every gate (idea-to-pdd, app-deploy, ocs-chatbot-eval-deep) gave the operator clean checkpoints to redirect. The `app-deploy` gate's "approve with caveats" path was exercised legitimately — the BLOCKERs were known platform gaps the operator chose to push past — and the orchestrator handled it by recording the disposition and continuing. The gate-brief format (artifact path / what to check / auto-surfaced concerns / recommended disposition) was load-bearing; without it the operator would have had to read raw artifacts cold at every halt. -- **Drive as durable state. Cross-session resumability worked.** Each gate-approval continuation was a fresh orchestrator dispatch (no `SendMessage` available in this Claude Code session), and each one re-read `state.yaml` and picked up at the next step. No state was lost across the 5+ dispatches. The "stateless skills, opp state in Drive" design is exactly correct for how operators actually work — and we now have a memory note for the next session not to flail searching for `SendMessage`. - -**What was wasteful:** - -- **Cleanup-without-probe on the golden template.** The 0.6.1 re-run's first move was `attach_knowledge([])` on the golden template to detach 350. That worked, but then publishing stripped the `{collection_index_summaries}` binding and broke every subsequent operation. The right sequence would have been: (1) read the pipeline graph first, (2) check whether collection bindings are independent of `collection_index_ids`, (3) only then decide whether `attach_knowledge([])` is safe. Cost: ~30 minutes recovering, plus the golden template now sits at v6 with 350 re-attached (more orphaned versions). Lesson: **probe before destructive cleanup**, especially against schema we don't fully own. -- **Same opp slug re-used across two runs.** The first run's stranded folder (My Drive `1SwMTQWE1C-…`) had to be trashed before the re-run could create a fresh `turmeric-dogfood-20260427/` on the right Shared Drive. Drive doesn't enforce slug uniqueness across drives, so the re-list saw both. The trash was cheap, but in a multi-operator scenario a stale-named folder could cause real confusion. Rule: when a run fails before `state.yaml` lands, the fix-and-retry must clean up the partial-create footprint before retrying with the same slug — `drive_create_folder` should arguably refuse if a folder of that name already exists in the parent (separate footgun). -- **Tight-loop tool searches for `SendMessage`.** Spent multiple turns re-searching for `SendMessage` to resume agents. This is now in memory at `feedback_sendmessage_availability.md`; future sessions read MEMORY.md at startup and shouldn't repeat. - -**What's worth keeping in mind for the next cycle:** - -- The two contract fixes (0.5.18, 0.6.1) closed exactly the bugs surfaced by this run, but the partial-save bug (P1) is the actual blocker for *measuring* the RAG-contamination hypothesis. Until P1 lands, every Phase 4 re-run hits the same wall. **Do P1 first**; everything else in the OCS lane follows from it. -- Three orphan turmeric bots are sitting on team `connect-ace` (UUIDs `20f7fe39-…`, `ce90d4db-…`, plus exps 12000 and 12003). Reachable via widget API, not load-bearing for production but cluttering. P2 (archive atom) closes this for future cycles; manual cleanup of these specific orphans is one OCS-UI session if desired. -- The opp-specific Drive folder `ACE/turmeric-dogfood-20260427/` has a real PDD, real test prompts, real verdicts, real gate briefs, and a comms-log. It is ready for a Phase 5 dispatch the moment we choose to put it in front of LLOs. Don't delete it. -- `comms-log/observations.md` in the opp folder is the **per-opp evidence log**; this run log is the **cross-opp strategy log**. The first sources the second. When jumping into the next cycle, read this run log first, then drill into observations.md for specific reproducers. - ---- - -## Addendum — 2026-04-28 evening: validation pass after P1 ships - -Same-day continuation. After 0.6.4 shipped (`ocs_set_chatbot_pipeline` transactional atom — P1 from above), we re-ran Phase 4 turmeric to test the headline question P4 was asking: *did the wrong-domain shared collection 350 explain the 1/10 source-usage score?* - -### Result: hypothesis confirmed - -| dimension | first run (post-0.6.1) | re-run (post-0.6.4) | Δ | -|---|---|---|---| -| composite | 6.5 | **9.1** | **+2.6** | -| correctness | 9 | 9.8 | +0.8 | -| source-usage | **1** | **8.0** | **+7.0** | -| tone | 9 | 9.2 | +0.2 | -| tagging | 7 | 9.5 | +2.5 | - -Bot 12003 was reconfigured via `ocs_set_chatbot_pipeline({prompt: , collection_index_ids: [365]})` — opp-specific prompt + per-opp collection only, **no `OCS_SHARED_COLLECTION_ID=350`**. Across all 28 deep-eval prompts: zero wrong-domain leakage (no CLP persona, no $5/service, no 100-services), 10/11 expected tags applied correctly, 3/3 escalations used the canonical phrase. The orchestrator's prediction at the end of run #1 (*"detach 350 → source_usage jumps from 1 to ≥7"*) was within 1 point of the actual delta. - -### P3 conclusion: skip the golden-template detach - -P3 was originally framed as *"once P1 lands, redo the detach properly."* Investigation in this pass showed it's a no-op: any detach attempt against the golden template hits the cross-field invariant (`{collection_index_summaries}` + empty collections is rejected). Architecturally cleaner: the golden template stays on `[350] + variable` as a working baseline; per-opp clones override at clone time via the new transactional atom. **P3 is closed without code changes** — the SKILL doc update in 0.6.4 step 8 is sufficient. - -### Two new platform bugs surfaced (next cycle's #1 and #2) - -**N1 — `{collection_index_summaries}` rejection still fires with non-empty collections in the same save.** The 0.6.4 transactional atom assumed the OCS rejection was about the *intermediate* state (variable + empty collections between two focused atom saves). The validation run proved otherwise: `set_chatbot_pipeline({prompt: , collection_index_ids: [365]})` — both fields set, both non-empty, in one save — **still rejected**. Workaround that worked: drop the variable from the prompt, keep `collection_index_ids: [365]`. RAG still functions because the collection binding is set; the prompt just doesn't auto-inject summary text. **Implication:** P1's framing of the bug ("intermediate-state cross-field violation") was wrong. The actual server-side rule is stricter than our pre-flight models — possibly the variable requires a *previously-published* collection state, not just a same-save state. Needs OCS-side investigation. The 0.6.4 atom is still useful (it bundles two changes into one save, eliminating ordering issues for non-variable cases) but the pre-flight needs revising or removal. - -**N2 — `experiment_id` regression in `ocs_get_chatbot` / `ocs_list_chatbots`.** Orchestrator reports both atoms returned `experiment_id: null` in the validation run despite 0.6.1's fix and 137 unit tests. Run only proceeded because the cached `experiment_id 12003` from earlier in the session was reused. **Severity high for fresh-session resume** — without `experiment_id`, every authoring atom is unreachable, and the SKILL's idempotency contract collapses again. Most likely cause: live OCS response shape diverges from the unit-test mock in a way the URL-extraction regex doesn't handle. Probe needed against the live API. Could also be a 0.6.4 side-effect (though the file diff shows no touches to listChatbots/getChatbot). **This is the next cycle's #1** — until it's fixed, the dogfood loop is not actually self-healing. - -**N3 — One factual concern in the eval transcript.** Prompt 21 elaborated a 4-level shininess scale (matte / slight sheen / shiny / very shiny) and a "Module 4 calibration exercise" not in the PDD. Either accurate detail from the `learn-app-summary` in collection 365 (legit) or hallucination. Worth one verification pass before any LLO sees the bot live. Tracked in `comms-log/observations.md`. - -### Backlog state after this addendum - -- **P1** — `set_chatbot_system_prompt` partial-save bug: **superseded by N1.** -- **P2** — `ocs_archive_chatbot` atom: still open. -- **P3** — re-cleanup golden template: **closed without code** (architectural decision: clones override at clone time). -- **P4** — re-run Phase 4 dogfood: **closed, hypothesis confirmed (composite 9.1).** -- **P5** — per-opp collection ingestion ("garbled binary"): **likely closed.** This run's bot has no garbled-binary disclosures across 28 prompts. -- **P6** — `drive_create_file` mimeType inference: still open. -- **N3** — one transcript factual check needed (shininess scale prompt 21). - -### Closed by this addendum - -- The cycle's central question (*does fixing 350 fix source-usage?*) — **yes.** -- P3, P4, and likely P5 from the original backlog. -- The 0.5.18 → 0.6.1 → 0.6.4 arc has measurable artifact-quality validation. The improvement loop is real. - ---- - -## Addendum 2 — 2026-04-28 late evening: N1 + N2 fixed in same session - -Both new platform bugs surfaced in the validation pass were diagnosed and shipped in the same session: - -### N2 — `experiment_id` regression (shipped 0.6.9, PR #63) - -Live `/api/experiments/` returns the API URL (`/api/experiments//`), not the human-facing `/a//chatbots//` URL the 0.6.1 regex assumed. Unit-test mocks had the wrong shape so the regression slipped through CI. Fix: composite-level enrichment via the `/a//chatbots/table/` HTMX endpoint — one Playwright scrape per `listChatbots`/`getChatbot`, name-keyed lookup. Validated: 5/5 chatbots on `connect-ace` resolve to integer `experiment_id`. - -### N1 — `{collection_index_summaries}` cross-field rule (shipped 0.6.10, PR #64) - -Characterized via a 6-case live OCS probe (`scripts/probe-n1-cross-test.ts`). The actual rule: - -> `{collection_index_summaries}` required **iff** `collection_index_ids.length >= 2`. - -Bidirectional: single/zero collections must NOT include the variable; multiple collections MUST. The 0.6.4 framing ("variable iff non-empty") was wrong — it accepted invalid `1 + variable` states. Fix: `assertCollectionPromptInvariant` shared helper, used by both `attachKnowledge` and `setChatbotPipeline`. SKILL.md step 7 corrected — single-collection per-opp clones (the canonical case) must NOT include the variable, matching exactly what the orchestrator did during the 9.1/10 validation run. - -### Final backlog state - -- ~~**N1**~~ ✓ closed (0.6.10) -- ~~**N2**~~ ✓ closed (0.6.9) -- **N3** — shininess scale factual check (prompt 21, untested) -- **P2** — `ocs_archive_chatbot` Playwright atom for orphan cleanup -- **P6** — `drive_create_file` mimeType inference - -### Cycle takeaways for the next session - -- The full 0.5.18 → 0.6.1 → 0.6.4 → 0.6.9 → 0.6.10 arc shipped within 36 hours, all driven by a single dogfood run + its validation pass. **Real-run > spec-review continues to compound** — every new bug surfaced was invisible before live execution. -- The OCS variable rule is durable knowledge for any future `attach_knowledge` work. Captured in tool descriptions, the helper docstring, the SKILL.md step 7 explainer, and the live-probe truth-table tests. Any future skill or atom that touches the LLM node's params will read those. -- Probe scripts (`scripts/probe-n1-*.ts`, `scripts/probe-experiment-id-recovery.ts`, `scripts/probe-table-anchors.ts`, `scripts/probe-composite-list.ts`) are kept under `scripts/` — they document the investigations and remain executable if OCS ever regresses or changes a contract. -- **Top priority for next cycle:** P2 (`ocs_archive_chatbot`) — the team `connect-ace` is accumulating orphan turmeric bots across iterations and there's no autonomous cleanup path. ~20 lines of Playwright-form scraping. Closes the orphan-debt class entirely. diff --git a/.claude/pm/runs/2026-04-29-eval-rubric-polish-operator-cant-fix.md b/.claude/pm/runs/2026-04-29-eval-rubric-polish-operator-cant-fix.md deleted file mode 100644 index 69e4368b..00000000 --- a/.claude/pm/runs/2026-04-29-eval-rubric-polish-operator-cant-fix.md +++ /dev/null @@ -1,95 +0,0 @@ -## 2026-04-29 — eval-rubric-polish-operator-cant-fix (custom lens) - -**Lens used:** "operator-can't-fix vs operator-can-fix" — every eval rubric must distinguish defects the operator can address (skill output errors, missing required fields, hallucinations) from constraints the operator literally cannot address (platform schema limits, capture-API restrictions, upstream environmental gaps, build-not-yet-produced stubs). When a rubric fails this distinction, it produces noise — penalizing skills for things outside their control — and the noise drowns the signal. Picked because the first non-degraded `connect-program-setup-eval` run on `turmeric-market-survey-2026-04-28` and the 0.9.11 cross-opp validation against `turmeric-dogfood-20260427` independently surfaced the same pattern in three different rubrics. Three instances = a class. - -**Background read:** `CLAUDE.md`, `docs/eval-calibration-learnings.md`, `skills/eval-calibration/SKILL.md`, `skills/README.md § QA vs Eval`, `lib/verdict-schema.ts`, `test/lib/verdict-schema.test.ts`, all 8 `*-eval/SKILL.md` files, the prior PM run `2026-04-28-turmeric-dogfood-ocs-contracts.md`, the verdict YAML on Drive at `ACE/turmeric-market-survey-2026-04-28/verdicts/connect-program-setup-eval-*.yaml`, CHANGELOG entries 0.9.0 → 0.10.5. Mid-cycle: `mcp/connect/backends/playwright.ts` and `mcp/connect/backends/html-scrape.ts` for the read-side bug investigation that became 0.10.6. - -**Core finding:** Five releases (0.10.6 → 0.10.10) shipped, all driven by one structural pattern. The same fix shape applies whether the rubric is grading Connect program creation, Nova-built apps, or OCS chatbots: introduce a category that *describes* the constraint instead of *deducting* for it. Concretely: - -- `[PLATFORM]` severity tier (0.10.7) — defects originating in the upstream service, not the skill. -- `[DRIFT]` severity tier (0.10.7) — discrepancies between artifact text and live state; diagnostic-only because the dimension consuming either source already deducts if either is wrong (counting drift again double-penalizes). -- `[INFO-SKIPPED]` severity tier (0.10.7) — sub-checks bypassed for missing input. -- `partial` verdict tier (0.10.7) — artifact correct, live verification unreachable. -- `incomplete` verdict tier (already prose, schema-formalized in 0.10.7) — structural gap prevents grading. -- HITL-stub guard (0.10.8) — early-return `incomplete` when Nova hasn't built the app yet. -- Clean-source branch (0.10.9) — switch the dimension's grading function when the input shape doesn't match the dimension's assumptions. -- Capture-method branch (0.10.10) — same pattern in `ocs-chatbot-eval` source-usage when transcripts come from the widget endpoint (which never returns inline citations) vs the OpenAI-compatible endpoint. - -The 0.10.6 fix is a *consequence* of the rubric working: `connect-program-setup-eval` flagged a real bug (Program fields empty after read), which traced to a read-side hydration bug in `getProgram` (not a write-side serialization gap as the verdict described). The eval framework caught a production bug — and the rubric's own diagnostic post-hoc was wrong about the layer, which seeded the 0.10.6 "defect-vs-cause discipline" rule (state observations confidently, phrase causes tentatively). - -The verdict schema bumped 1 → 2 (additive — every v1 verdict still validates as v2). Six new schema tests cover the new tiers; 218/247 tests passing every release. The prose contracts in rubric SKILL.md files had referenced `incomplete` for months before the schema accepted it — purely doc-level drift until any rubric actually emitted that value. 0.10.7 closed that gap explicitly. - -### Do it - -1. **`getProgram` read-path fix + defect-vs-cause discipline (0.10.6)** — Effort: S — Status: **done, shipped 0.10.6** - - Commit: `718595b` - - Outcome: `connect-program-setup-eval` on `turmeric-market-survey-2026-04-28` flagged "Program created with all fields filled but `connect_get_program` returns empty fields." The verdict attributed the cause to a write-side serialization gap. Wrong layer. `getProgram` (mcp/connect/backends/playwright.ts:133) wrapped `listPrograms`, and `parseProgramsList` (html-scrape.ts:54) only extracts `name` + `description` from the list page — every other field is hardcoded to `0`/`''` with the comment *"caller can hydrate via getProgram() if needed."* But getProgram never hydrated. Fix mirrors the existing `getOpportunity` pattern: read `/a//program//edit` and use `extractFormFieldValues`. Strengthened the integration test (was asserting only `p.name`; now asserts all 8 hydrated fields). Plus added step-8 "Defect-vs-cause discipline" to the rubric: state observations confidently, phrase causes tentatively, format as `Observed: . Likely cause (unverified): .` LLM-as-Judge rubrics tend to pattern-match defects to the most familiar root-cause label rather than reasoning about layer; this rule constrains the pattern. - -2. **Verdict schema v2 + `connect-program-setup-eval` 5-item polish (0.10.7)** — Effort: M — Status: **done, shipped 0.10.7** - - Commit: `e992109` - - Outcome: First non-degraded grading of the rubric on the turmeric run produced five distinct findings; ship as one batch since they're all in the same rubric file and internally consistent. Schema gained `partial` and `incomplete` as top-level verdict tiers (the prose has referenced `incomplete` for months — schema-only drift); `PLATFORM`, `DRIFT`, `INFO-SKIPPED` as severity tiers; optional `live_state_verified: boolean` and `overall_score_pre_cap: number`. Per-item verdict stays `pass | warn | fail` since item-level entries are by definition graded. SCHEMA_VERSION 1 → 2. The rubric polish: (1) `partial` tier definition for runtime-blocked-but-not-degraded; (2) `[PLATFORM]` use for Connect schema limits the skill can't bypass; (3) `[DRIFT]` for `connect-setup-summary` ↔ live-state discrepancies, with explicit "diagnostic only, never deductive" rule (counting drift double-penalizes); (4) payment threshold-sanity now explicitly conditional — emit `[INFO-SKIPPED]` when no PDD day-rate; (5) `live_state_verified` boolean caps verdict ≤ partial when false. Six new schema tests, prose contract in `skills/README.md` re-synced with code. - -3. **HITL-stub branch in app-eval rubrics (0.10.8)** — Effort: S — Status: **done, shipped 0.10.8** - - Commit: `f0476d8` - - Outcome: 0.9.11 cross-opp validation found that `pdd-to-deliver-app-eval` and `pdd-to-learn-app-eval` both mis-graded HITL-pending app summaries on `turmeric-dogfood-20260427` (Nova hadn't finished building yet, summary was a stub). The deliver rubric got 2 of 5 dimensions ungradable; the learn rubric's most load-bearing dimension (assessment_score_wiring at 30%) graded the stub as "wiring entirely missing" → forced ≤3 → fail on a build that wasn't actually a defect. Both rubrics now have a step-2 guard that emits `verdict: incomplete` immediately when `nova_app_id` is missing/null/TBD or the summary is skeleton-only. Mirrors `connect-program-setup-eval`'s degraded-mode pattern: structural gaps in the upstream environment are environmental, not quality defects. - -4. **Clean-source branch in `idea-to-pdd-eval` (0.10.9)** — Effort: S — Status: **done, shipped 0.10.9** - - Commit: `792ba2e` - - Outcome: The reviewer-comment-fidelity dimension (20% weight) assumed every idea.md contains formal `[a]/[b]` reviewer footnotes. Clean PM-authored sources have none; the rubric scored gracefully by treating PDD's Open Questions as analog, but the anchors at 9.5 ("all comments addressed") were measuring a vacuously-true question. Now: step 2 detects `clean_source = true` automatically (no footnotes, no Comments/Feedback section). When true, the dimension switches to grading **deferred-decision discipline** — looks for a section explicitly handling uncertainty (Open Questions / Deferred Decisions / TBD-per-LLO / Phase-1-Discovery) with concrete questions, owner phases, resolution mechanisms. Anchors 9.5 → 4.0. Surfaces `[INFO] clean-source branch active` for auditability. Dimension-semantics fix, not deduction-tuning. - -5. **Capture-method branching in `ocs-chatbot-eval` source-usage (0.10.10)** — Effort: S — Status: **done, shipped 0.10.10** - - Commit: `4403515` - - Outcome: The original "empty `cited_files` + body names sources → ≤5 cap" rule was meant to catch a pipeline bug. But `ocs-chatbot-qa` captures exclusively via the anonymous widget endpoint, which doesn't return inline citation markup regardless of bot grounding. The cap fired on every widget transcript — same noise/signal conflation as the Connect schema limit issue. Now: `ocs-chatbot-qa` writes `Capture method: widget | openai-compat` in the transcript header. `ocs-chatbot-eval` source-usage dimension branches on it. Widget captures grade body-text grounding (does the response name source docs by title? does it paraphrase content the KB demonstrably contains?) and emit `[PLATFORM] empty cited_files expected on widget capture` instead of binding the cap. OpenAI-compat captures keep the existing two-tier cap (empty `cited_files` there IS a real grounding gap). - -### Backlog - -P1 and P2 unblock the next calibration cycle; P3–P5 are post-real-run. - -**P1 — Cross-model variance audit on the 4 provisional rubrics:** -- `connect-program-setup-eval`, `cycle-grade-eval`, `llo-launch-eval`, `flw-data-review-eval` are all provisional pending cross-model verification (Sonnet/Opus/Haiku spread ≤ 1.0). The audit can't run usefully against synthetic data — it needs real artifacts. Defer until a non-degraded production run produces the four input artifacts. The 0.10.7 rubric polish on `connect-program-setup-eval` should reduce its variance specifically (PLATFORM/DRIFT entries no longer randomly hit the inflation guard); next audit run is the test of that hypothesis. - -**P2 — Real artifacts for the 4 provisional rubrics:** -- `connect-program-setup-eval` needs a non-degraded Phase 3 with `live_state_verified: true` (i.e., `connect_get_*` MCP calls succeeded). Now that 0.10.6 fixed the read-path bug and 0.10.1 fixed the opportunity creation 500, the next opp dispatch should produce one cleanly. Other three need: first launch (`llo-launch-eval`), first weekly review (`flw-data-review-eval`), first closed cycle (`cycle-grade-eval`). All are real-run blockers. Authoring rubrics in their absence repeats the original sin (rubrics that confidently score 8.5 on nothing) — explicitly documented as anti-pattern in `eval-calibration-learnings.md § 1`. - -**P3 — Three minor operate-category rubrics (`llo-invite-eval` / `llo-onboarding-eval` / `llo-uat-eval`):** -- Mentioned as backlog in 0.9.11 + 0.10.6 memory updates. None ship today. Each is straightforward (mirror `flw-data-review-eval` structure: 5 dimensions, recurring shape, dated verdicts) but each needs ≥1 real run to calibrate against. Same blocker as P2: don't author without ground truth. - -**P4 — Operator-effort tracking in `state.yaml`:** -- A meta-eval signal nobody has today: how many gate-iterate cycles per phase, how many minutes operators spend reviewing each gate brief, which skills produce the most "approve with caveats" rationale text. Lets us spot rubrics where the *operator* keeps overriding even when the rubric scores high. Design is the work; small implementation. Defer until at least one real cycle has flowed through cleanly so we know what the field shape should be. - -**P5 — Drift between rubric prose and schema (preventer pattern):** -- 0.10.7 explicitly resynced `lib/verdict-schema.ts` with what 8 rubric SKILL.md files were claiming. The drift was harmless because nothing called `validateVerdict` at runtime on a real verdict — only the test does. Class-level preventer worth considering: hook a CI step (or `/ace:doctor`) that loads each `*-eval/SKILL.md`, regex-extracts every `verdict:` and `severity:` literal in YAML examples, and asserts each one is in the schema enum. Small effort; would prevent the next instance of the same drift. Backlog because it's preventer not blocker. - -### Closed - -**P1 from 2026-04-28-turmeric-dogfood-ocs-contracts.md (`set_chatbot_system_prompt` partial-save bug):** Already shipped in 0.6.4 (commit `cf45a59`, "transactional `set_chatbot_pipeline`"). Out-of-band closure during this session's prep read. - -### Skipped on this run (raised but not formally proposed) - -- **`/ace:doctor` post-update sweep** — offered to run it at session-end but user closed before invocation. Pre-existing CHANGELOG suggests doctor checks are mature (0.5.4 / 0.5.9 / 0.5.18 / 0.7.1 all added preventer probes); the load-error noted on `/reload-plugins` may or may not be ours. First action for the next session. -- **Re-grade `turmeric-market-survey-2026-04-28` against the new schema/rubric** — every release this session would change the verdict YAML for that opp. Skipped because re-grading without rerunning the underlying skill would just be testing the rubric against a frozen capture; the more useful test is to wait for the next opp run and let the new branches activate live. The 0.10.6 fix in particular needs a real `connect_get_program` read against a live program to confirm hydration works end-to-end. -- **Bumping `eval-calibration` skill itself with the new patterns** — `skills/eval-calibration/SKILL.md` is the methodology spec. The "operator-can't-fix" pattern is now durable enough to bake in (criteria for adding new severity tiers, how to detect when a dimension's input shape doesn't match its assumptions). Two paragraphs of work; deferred because it's better to wait for one or two more uses of the pattern before claiming it generalizes. - -### Meta-observations - -**What worked well:** - -- **Five releases of size-S each beat one release of size-M.** Resisted the temptation to bundle 0.10.7 + 0.10.8 + 0.10.9 + 0.10.10 into a single "rubric polish pass." Per-release CHANGELOG entries plus per-rubric Change Log table rows preserve the audit trail at the granularity calibration actually needs (per `eval-calibration-learnings.md § Score trajectory across iterations is the audit trail`). Each release answers exactly one question; future cross-model audits can attribute variance to specific rubric edits without bisection. - -- **Schema bump was load-bearing despite zero runtime impact.** `validateVerdict` only runs from tests today. But making the schema match the prose means the *next* time someone hooks runtime validation (CI, `/ace:doctor`, or a future `opp-eval` aggregator), every existing rubric still validates. Also gave the v2 changes a clean sentinel — `SCHEMA_VERSION === 2` is the marker for "PLATFORM/DRIFT/INFO-SKIPPED severities and partial/incomplete verdicts are formal, not aspirational." - -- **The "operator-can't-fix" lens generalized fast.** Started as one polish item in `connect-program-setup-eval` (`[PLATFORM]` for Connect schema limits). Within the same session it absorbed three other findings (HITL-stub, clean-source, capture-method) that on inspection were all the same fix shape. Worth promoting to first-class rubric design rule, not just a fix recipe — added to `project_eval_framework_state.md` memory entry as the durable framing. - -- **Verifying that the bug-the-rubric-caught was real.** Bug #2 from the prompt (Opportunity 500) was already fixed in 0.10.1 — confirmed before touching anything (commit `48e2380` was driven by the same turmeric run that the eval flagged). Bug #1 (Program "serialization") was unfixed and traced to a read-side hydration bug in 30 minutes once I read `getProgram` and `parseProgramsList`. The eval-framework-as-bug-finder claim in the docs is now grounded in two production bugs caught and fixed, not just one. - -**What to do differently next time:** - -- **Run `/reload-plugins` mid-session, not at end.** The 1-error-on-reload at session close means the new code path may already have a regression; we shipped 5 releases without exercising the loaded plugin once. A mid-session reload after 0.10.7 (when the schema bump landed) would have surfaced any plugin-load issue immediately. This generalizes: **after any schema or skill-prose change that the harness re-parses, reload before continuing.** Same class as "real run > spec review." - -- **Don't guess root cause from a verdict YAML alone.** I came close to reproducing the rubric's own bias on bug #1 — almost wrote the fix as a write-side change before grepping for `getProgram`. The defect-vs-cause discipline rule (0.10.6) is the durable countermeasure for the rubric, but the same rule applies to **operators reading verdicts**: read the code, not the verdict's diagnosis. - -- **Watch for stale prompt anchoring.** The session prompt was written at 0.9.11 and still framed everything as "0.9.12 backlog." Real version was 0.10.5. Caught the drift after one round of confused output but should be a default check at session start: "what version is `main` actually at, vs what the prompt claims?" - -**Pattern emerging across sessions:** - -- This session's "operator-can't-fix" lens, the 2026-04-28 session's "class-level preventers > instance-level fixes," and the 2026-04-19 session's "real run > spec review" are all variations of the same root principle: **rubric/contract design must distinguish noise from signal at the boundary, not at the consumer.** The `[PLATFORM]` severity tier IS a class-level preventer for false deductions; the schema enum extension IS a contract-level distinction; the HITL-stub guard IS a real-run-detected gap. Three sessions in, this looks like the dominant ACE design rule. Worth surfacing more prominently in `CLAUDE.md § Conventions` once one or two more uses confirm it generalizes. diff --git a/VERSION b/VERSION index 7e18a715..286d83dd 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.13.332 +0.13.333 diff --git a/docs/learnings/2026-04-pm-runs-compacted.md b/docs/learnings/2026-04-pm-runs-compacted.md new file mode 100644 index 00000000..4cfcc774 --- /dev/null +++ b/docs/learnings/2026-04-pm-runs-compacted.md @@ -0,0 +1,64 @@ +# Learning: April 2026 PM cycles — compacted + +**Date**: 2026-05-22 (compaction) +**Context**: 10 PM scout cycle logs from April 8 → April 29, 2026, distilled into the durable signals that shaped subsequent platform behavior. Originals deleted in the same change; git history preserves them under `.claude/pm/runs/2026-04-*.md`. +**Status**: Resolved — every cycle's findings either shipped or were re-surfaced in a later durable learning. + +## Why compact + +The early-April logs predate the 0.13.x rewrites. Their lens-by-lens cadence narrates *how* the platform reached its current shape. None of the per-cycle Do-It / Backlog / Skipped sections are read by any agent or skill today; the parts worth keeping are the **cross-session patterns** and the **decisions that hardened into conventions**. + +Per `CLAUDE.md § Improvement cycles & canopy` the convention is to copy the structure of the most recent run when writing a new log. That convention is intact; only the historical pile is being cleared. + +## Cross-cycle patterns (what these 10 sessions taught the project) + +### 1. Archetypes are first-class (Apr 8 → cycle-defining) + +The Apr 8 focus-group-framework cycle found that every skill was hard-coded to one delivery archetype: "one FLW visit = one photo + GPS + form." The variation-points-per-skill approach (not fork-the-framework) became the canonical fix. This is the seed of CLAUDE.md's *Archetypes are first-class* Convention. Subsequent FGD work (PR series May 2026) validated the model end-to-end. + +### 2. Real run > spec review (recurring across Apr 19, Apr 28, throughout) + +Every cycle that exercised the live skill chain against real content surfaced bugs invisible to spec review. The Apr 28 turmeric-dogfood cycle made it explicit: "real run > spec review" — same observation logged in the Apr 19 cycle. **This is the load-bearing reason `/ace:qa-deep`, the per-skill `-eval` chain, and live MCP integration tests exist.** Dogfooding against real PDDs surfaced the OCS `{collection_index_summaries}` cross-field rule, the `experiment_id` regression class, the partial-save bug, the wrong-team collection class, and the env-drift class — none of which were predictable from the spec. + +### 3. Class-level preventers > instance-level fixes (Apr 19 → Apr 20) + +Each cycle that landed a fix that "caught only the case in front of us" produced another instance of the same class next cycle. The Apr 20 collection-clone-and-mcp-preflight cycle ended by adding MCP-layer defenses against the silent-block class (not just the one collection). The Apr 20 env-drift cycle added a doctor probe for `.env.tpl` drift (not just the one missing key). This pattern is now the *Class-level preventers* Convention in CLAUDE.md. + +### 4. Doctor probes are how invariants survive (Apr 20) + +The Apr 20 morning env-drift cycle proved the failure mode: `.env.tpl` adds keys, installed `.env` doesn't auto-update, doctor reported COMPLETE on 3-of-16 keys. The fix shape that stuck: doctor probes a live HTTP call per MCP, names the exact remediation per failure. The 0.7.1 `ocs_shared_collection_team` probe (50ms HTTP request that turns "configured" into "configured correctly") was the canonical follow-on. + +### 5. Operator-can-fix vs operator-can't-fix (Apr 29 → eval architecture) + +Three rubrics (connect-program-setup-eval, app-summary-eval, ocs-chatbot-eval) independently surfaced the same noise pattern: penalizing skills for upstream platform constraints the operator can't address. The fix shape that landed in 0.10.6 → 0.10.10: introduce a category that **describes** the constraint instead of **deducting** for it. This is now the structural shape of every `-eval` rubric. See `docs/eval-calibration-learnings.md` for the full methodology. + +### 6. Stale metadata is more dangerous than missing metadata (Apr 20) + +The Apr 20 collection-clone cycle was anchored on the wrong premise (collection 718 didn't exist on connect-ace) because `~/.ace/connect-ocs-bot.json` was stale by 11 days. Reading that file as ground truth burned a half-day chasing a non-existent team-infrastructure problem. This learning hardened into the CLAUDE.md Gotcha: **"Drive metadata files (`~/.ace/*.json`) are hypotheses, not truths. Stale snapshots have anchored multi-day investigations down wrong paths. Re-probe live state before acting on metadata older than ~7 days."** + +## Findings that shipped (high-traffic items) + +| Source cycle | Finding | Where it landed | +|---|---|---| +| Apr 8 | Archetype as PDD field with skill-level branches | CLAUDE.md Convention + `Archetype:` PDD frontmatter | +| Apr 15 | `/ace:setup`, `/ace:doctor`, `/ace:update` first-run polish | Stable since 0.3.x | +| Apr 16 | State.yaml lifecycle / fixture drift catches | `lib/artifact-manifest.ts` + `test/fixtures/CRISPR-Test-*` | +| Apr 17 | Per-opp ownership in state schema | `opp.yaml` / `run_state.yaml` split | +| Apr 19 | qa+eval two-axis pattern + opp-eval umbrella | CLAUDE.md Convention + per-skill `-eval` siblings | +| Apr 20 (am) | Cross-team collection scoping | `ocs_shared_collection_team` doctor probe | +| Apr 20 (pm-1) | `.env.tpl` drift detection | doctor `[Auth liveness]` block | +| Apr 20 (pm-2) | Dead `.env.tpl` keys | grep-based unused-key audit, removed 4 dead vars | +| Apr 28 | OCS `{collection_index_summaries}` cross-field rule | `assertCollectionPromptInvariant` + `scripts/probe-n1-cross-test.ts` | +| Apr 29 | Operator-can-fix vs constraint categories | Every `-eval` rubric since 0.10.6 | + +## What did NOT carry forward (dropped during compaction) + +The early cycles each carried a P1-P7 Backlog block. Most rolled forward across cycles and either shipped (see table above) or got recharacterized in later work. A few died on the vine and aren't worth restoring: + +- **"fgd-synthesis" skill** (recurring P6/P7 backlog) — superseded by the May 2026 FGD archetype refactor where the OCS chatbot became the primary facilitator surface, no separate synthesis skill needed (see `project_fgd_archetype_complete` memory). +- **Per-cycle "Self-improvement (canopy-skills meta-PRs)" sections** — these were prompts/cadence tweaks for the canopy PM-scout skill itself, not ACE. Lived their useful life in the cycle they shipped. +- **Per-cycle "Confidence on validation"** — meta-prompt-engineering notes that no longer apply. + +## How to apply + +If you find yourself wondering "why is X the way it is?" about a CLAUDE.md Convention or Gotcha, the original cycle log probably explains the forcing function. Git-blame the Convention or Gotcha and trace back to the PR; from there, the contemporary PM run log (now in git history) gives the full forensic narrative. The convention itself is the durable artifact — this learning is the index into the archaeology. diff --git a/package.json b/package.json index fc3072be..dab28fae 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "ace", - "version": "0.13.332", + "version": "0.13.333", "description": "AI Connect Engine - orchestrator for building Connect Opps using AI", "type": "module", "scripts": { diff --git a/skills/app-deploy/SKILL.md b/skills/app-deploy/SKILL.md index b16621e2..f982d0e7 100644 --- a/skills/app-deploy/SKILL.md +++ b/skills/app-deploy/SKILL.md @@ -201,7 +201,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-17 | Emit gate brief at `ACE//runs//3-commcare/app-deploy_gate-brief.md` covering build status, Connectify flags, and workaround-path warnings for the Phase 3→4 gate | ACE team (PM scout, internal-admin lens) | | 2026-04-27 | Switch from manual HQ-UI upload to `/nova:upload_to_hq` via the Nova plugin. Inputs are now `nova_app_id` values read from the app summaries. New pre-flight check compares Nova's bound HQ project space against `ACE_HQ_DOMAIN`. Gate brief drops the workaround-path WARN and adds a domain-mismatch BLOCKER. | ACE team | | 2026-04-29 | Carve out app release into the new `app-release` skill (Step 2.5 of Phase 3). This skill now ends at "draft uploaded" — release is a separate, permission-sensitive step. Reason: Connect's `Sync Deliver Units` only enumerates units from released builds, so unreleased apps silently break Phase 4's payment-unit config. (0.10.1) | ACE team | diff --git a/skills/app-release/SKILL.md b/skills/app-release/SKILL.md index 76560feb..7b38ab7d 100644 --- a/skills/app-release/SKILL.md +++ b/skills/app-release/SKILL.md @@ -397,8 +397,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-29 | Initial version. Carved out as a separate Phase 3 step (between `app-deploy` and `connect-opp-setup`) after the turmeric-market-survey-2026-04-28 dogfood made it clear that "Nova upload" and "released and discoverable by Connect" are different states. (0.10.1) | ACE team | -| 2026-04-29 | Correct the prerequisite section: ace@dimagi-ai.com IS Admin on connect-ace-prod (verified live). The UI's "Sorry, you don't have permission" banner is a Knockout fallback for any `buildState() == 'error'`, not a literal permission verdict. Replace the bad pre-flight with an empirical probe procedure for endpoint discovery — CCHQ's `Make New Version` and `Make Released` URL patterns aren't stable public APIs and need to be re-discovered when the UI changes. (0.10.3) | ACE team | | 2026-04-29 | Discovered + verified the actual endpoints on `/apps/view//releases/`: `POST /apps/save//` (empty body) returns the new build with `_id`; `POST /apps/view//releases/release//` with `ajax=true&is_released=true` flips the release flag. Tested live against `0c96435881b0...` (deliver) and `76fd5f0e2834...` (learn) on connect-ace-prod — both successfully released. Also documented the Connect-side sync endpoint: `POST /a//opportunity//sync_deliver_units/`. (0.10.4) | ACE team | | 2026-04-29 | Add Connect-coverage pre-flight (Step 3) and CCZ verification (Step 6) — checks Nova blueprints have `connect.deliver_unit` / `learn_module` / `assessment` set on every form, then verifies the released CCZ has `` / `` markers. Document two upstream Nova bugs that cause silent failures: (a) autobuild often skips Connect markers entirely; (b) `update_form deliver_unit` runtime auto-fills empty `entity_id`/`entity_name` that serialize as invalid XPath, breaking the build. Both need Nova upstream fixes; the skill surfaces clear pointers when either is detected. Learn-app pipeline currently works end-to-end; Deliver-app pipeline blocks on bug (b). (0.10.5) | ACE team | | 2026-04-29 | Move Connect-marker verify+fix into a dedicated Phase 3 Step 1.5 skill (`app-connect-coverage`) that runs after Nova builds and before deploy. This skill's pre-flight now just consumes that skill's `clean | blocked` verdict instead of duplicating the logic. Step 6 CCZ verification stays here as the post-release sanity check. (0.10.7) | ACE team | diff --git a/skills/app-screenshot-capture/SKILL.md b/skills/app-screenshot-capture/SKILL.md index a167e9d8..1a304143 100644 --- a/skills/app-screenshot-capture/SKILL.md +++ b/skills/app-screenshot-capture/SKILL.md @@ -589,9 +589,6 @@ Notes: | Date | Change | Author | |---|---|---| -| 2026-04-28 | Initial version (mobile-emulation work) | ACE team | -| 2026-04-30 | Refactored as Phase 6 Step 2 — now consumes the `qa-plan` skill's manifest as its source of truth for what to capture, instead of generating recipes itself. Captures only **per-opp** content; common Connect navigation screenshots are sourced from `ACE/_common/connect-screenshots//` produced by the standalone `connect-baseline-screenshots` skill. Switched PNG upload from text-encoded `drive_create_file` to `drive_upload_binary` (0.10.43) so screenshots upload as native PNGs. (0.10.44) | ACE team | -| 2026-05-04 | Phase 6 executor pivot — drops `qa-plan` synthesis. Now reads `expected-journeys.md` (Phase 1) and `app-test-cases.yaml` (Phase 3) as inputs, runs only the two `is_smoke: true` recipes (one per app), and adds a thin per-app UX smoke judge (~2 LLM calls). Writes a new shallow verdict at `verdicts/app-screenshot-capture-shallow.yaml`. Deep, per-journey UX grading moves to `app-ux-eval` running from `/ace:qa-deep`. Spec: docs/superpowers/specs/2026-05-04-shallow-deep-qa-split-design.md | ACE team | | 2026-05-05 | **Path-scheme migration.** Inputs repointed to `2-scenarios/pdd-to-app-journeys.md`, `3-commcare/app-test-cases.yaml`, `3-commcare/app-deploy_summary.md`, `3-commcare/recipes/`. Outputs repointed to `6-qa-and-training/screenshots//.png`, `6-qa-and-training/app-screenshot-capture_manifest.yaml`, `6-qa-and-training/app-screenshot-capture_verdict.yaml`, `6-qa-and-training/app-screenshot-capture_verdict-shallow.yaml` (per manifest). Both verdict YAML examples' `capture_path` updated. No behavior change beyond paths. | ACE team | | 2026-05-06 | **Step 2 input-completeness pre-flight** — restructured the post-Step-1 logic into an explicit failure-mode table that distinguishes upstream Phase 3 incomplete output (master yaml without recipes) from smoke-flag malformation. Each failure halts with a named PLATFORM auto_surfaced message + the exact `/ace:step` remediation command, and writes `verdict: incomplete` (not `fail` — upstream gaps aren't smoke failures). Surfaced by leep-paint-collection run 20260506-1440 where a Phase 3 dispatch paraphrased the `app-test-cases` SKILL contract and elided the per-journey recipe outputs; `app-screenshot-capture` halted correctly but the operator-facing message conflated the failure mode with general "missing input" diagnostics. See jjackson/ace#106 finding #3 + #16. | ACE team | | 2026-05-07 | **Step 5 anyone-with-link via `drive_upload_binary({shareAnyoneWithLink: true})`** — replaces the previous unfulfillable contract (the SKILL named `drive.permissions.create` but no MCP atom implemented it). The new flag sets `role: reader, type: anyone` atomically at upload time, eliminating the "deck builds without errors but slides are empty" failure mode. Standalone `drive_set_anyone_with_link({fileId})` atom also added for retroactive sharing. See jjackson/ace#115 finding #3. | ACE team | diff --git a/skills/connect-opp-setup/SKILL.md b/skills/connect-opp-setup/SKILL.md index f1a519b1..0f36b240 100644 --- a/skills/connect-opp-setup/SKILL.md +++ b/skills/connect-opp-setup/SKILL.md @@ -661,13 +661,6 @@ Each row this skill writes uses `phase: 4-connect` and | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-08 | Add `## Archetypes` section: focus-group delivery unit = session (not participant), audio + attendance + per-domain summary verification, requires "Experiment" delivery type | ACE team (PM scout, focus-group framework lens) | -| 2026-04-08 | Add explicit step 2 to read PDD `## Evidence Model`; Layer A → verification rules, Layer B/C → soft flags; error if Evidence Model missing | ACE team (PM scout, focus-group framework lens) | -| 2026-04-28 | Replace HITL workaround with `connect_*_opportunity` + `connect_set_verification_flags` + `connect_create_payment_unit` atoms (ace-connect 0.8.1). | ACE team | -| 2026-04-28 | Add Step 8: invite ACE test user (`${ACE_E2E_PHONE}`) and persist invite URL to `connect-state.yaml`; required for Phase 6 `app-screenshot-capture` to drive the claim-opp flow | ACE team (mobile-emulation) | -| 2026-04-30 | Adopt commcare-connect PR #1135's automation API (0.10.47). `connect_create_opportunity` is now `POST /api/programs//opportunities/`, takes structured `learn_app`/`deliver_app` payloads + dates + total_budget upfront. Eliminates the two-step "create → finalize" flow and the silent-500 schema bugs around `hq_server` resolution + `api_key` registration + `learn_app`/`deliver_app` JSON wrapping (the server now does all of it). `register_hq_api_key` and `finalize_opportunity` atoms removed. `connect_create_payment_units` (plural) added for atomic-batch creation. FLW pre-invite now requires opp to be active first — coordinate with `llo-launch`. | ACE team | -| 2026-05-04 | **Verify-after-create discipline** added to Step 4 (opportunity) and Step 6 (payment units) — every external write is now followed by an immediate read-back, with `[BLOCKER]` halt on field misalignment. Catches the class of bug `turmeric-20260503-0835` hit: PU created with shifted values (`amount=500` vs sent `1.50`, `max_total=20` vs sent `500`, `required_deliver_units=[]` vs sent `[Vendor Visit]`), which cascaded through `is_setup_complete` to break Phase 8 invites and Phase 6 screenshot capture. Catching at the source converts a multi-phase cascade into a single-skill halt. Also: `short_description` cap doc fix (≤50 chars server-enforced, was wrongly documented as ≤255); `amount` integer-rounding behavior pinned (recommended: round + INFO-log, never silent truncate); empty `required_deliver_units` flagged as a downstream cascade trigger. See `agents/ace-orchestrator.md § External Mutations — Verify After Create` for the cross-skill rule. | ACE team (0.11.11) | | 2026-05-08 | Add `## Decisions Log` section: 3 anchor rows (verification-flags, payment-unit-shape, opportunity-end-date) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | | 2026-05-10 | Move opp activation + ACE test-user invite from Phase 9 into Phase 4 (new Step 6.5 + rewritten Step 7). Closes the chicken-and-egg gap where Phase 6 `app-screenshot-capture` produced placeholder screenshots because the test user wasn't on the new opp yet — the opp couldn't be activated until Phase 9, but the test user couldn't be invited until activation. Phase 9 `llo-launch` now hits its idempotent skip-if-active path on every ACE-driven run; it still sends the real-LLO invite to the awarded LLO. Also: tighten Step 4 `is_test` from "defaults true server-side" to "set explicitly to true" — ACE is in dogfood mode and every opp it creates must be test-flagged so prod analytics, payment exports, and partner dashboards exclude these runs. | ACE team | | 2026-05-10 | State consolidation PR a: retire `connect-state.yaml`; emit a single `run_state.yaml.phases.connect-setup.products.connect` block at end of Step 10. Step 7 holds invite metadata in memory rather than writing immediately. (Initial implementation dual-wrote to `opp.yaml.connect`; corrected on 2026-05-11 — runs are now independent. `opp.yaml.connect.program` is durable cross-run state written by `connect-program-setup`; `opp.yaml.connect.opportunity` / `ace_test_user` are no longer written here.) See `docs/superpowers/specs/2026-05-10-state-consolidation.md`. | ACE team | diff --git a/skills/connect-program-setup/SKILL.md b/skills/connect-program-setup/SKILL.md index 8fbc0ce8..0142c171 100644 --- a/skills/connect-program-setup/SKILL.md +++ b/skills/connect-program-setup/SKILL.md @@ -128,6 +128,5 @@ downstream coherence: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-28 | Replace HITL workaround with `connect_*_program` atoms (ace-connect 0.8.1) | ACE team | | 2026-04-30 | Switch `connect_create_program` to `POST /api/programs/` (commcare-connect PR #1135). `delivery_type` now accepts the slug; `country` is the human country name. (0.10.47) | ACE team | diff --git a/skills/cycle-grade/SKILL.md b/skills/cycle-grade/SKILL.md index 84cc9e7b..5a4d4b9a 100644 --- a/skills/cycle-grade/SKILL.md +++ b/skills/cycle-grade/SKILL.md @@ -97,6 +97,5 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-08 | Add `## Archetypes` section: focus-group grading uses facilitation-quality and research-yield rubrics for FLW Performance / Intervention Effectiveness, plus a 7th Research Quality dimension; multi-stage grades stage-gate transitions | ACE team (PM scout, focus-group framework lens) | | 2026-04-08 | Read PDD `## Evidence Model` in step 1; Layer A drives FLW Performance evidence, Layer B/C drive Intervention Effectiveness / Research Quality evidence | ACE team (PM scout, focus-group framework lens) | diff --git a/skills/email-communicator/SKILL.md b/skills/email-communicator/SKILL.md index cc81f4d4..d349a968 100644 --- a/skills/email-communicator/SKILL.md +++ b/skills/email-communicator/SKILL.md @@ -70,9 +70,3 @@ None — this skill uses the GOG CLI via shell commands, not MCP tools. When `--dry-run` is active: - **Send/reply:** Print the full email (to, cc, subject, body) to stdout but do not send. Return a synthetic message ID for logging. - **Search/read:** Execute normally (read-only operations are safe in dry-run). - -## Change Log - -| Date | Change | Author | -|------|--------|--------| -| 2026-04-10 | Initial version | ACE team | diff --git a/skills/flw-data-review/SKILL.md b/skills/flw-data-review/SKILL.md index 52124c23..1f8ecc54 100644 --- a/skills/flw-data-review/SKILL.md +++ b/skills/flw-data-review/SKILL.md @@ -113,6 +113,5 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-08 | Add `## Archetypes` section: focus-group review = qualitative synthesis (per-session quality, cross-session themes, saturation, quote bank), no quantitative outlier checks | ACE team (PM scout, focus-group framework lens) | | 2026-04-08 | Read PDD `## Evidence Model` in step 1; Layer B drives per-delivery evaluation, Layer C drives cross-delivery synthesis | ACE team (PM scout, focus-group framework lens) | diff --git a/skills/idea-to-pdd-eval/SKILL.md b/skills/idea-to-pdd-eval/SKILL.md index 823a626f..56f041ab 100644 --- a/skills/idea-to-pdd-eval/SKILL.md +++ b/skills/idea-to-pdd-eval/SKILL.md @@ -198,9 +198,6 @@ See `skills/_eval-template.md § Dry-Run Behavior (stock)`. | Date | Change | Author | |------|--------|--------| -| 2026-04-28 | Initial version. 5 dimensions: stress_test_agreement (0.25), reviewer_comment_fidelity (0.20), structural_completeness (0.15), archetype_coherence (0.20), concreteness (0.20). Inflation guard at 7.5 when self-eval is 5/5 but this rubric is ≤7.5. Companion to `pdd-to-deliver-app-eval`; covers the design category for `opp-eval` aggregation. | ACE team (eval system buildout — 0.9.2) | -| 2026-04-29 | Clean-source branch added to reviewer_comment_fidelity dimension. When idea.md has zero reviewer comments (set `clean_source = true` in step 2), the dimension switches from comment-disposition grading to deferred-decision-discipline grading: looks for an explicit Open Questions / Deferred Decisions / TBD-per-LLO section with concrete questions, owner phases, and resolution mechanisms. New anchors (9.5 → 4.0). Surfaces `[INFO] clean-source branch active` in `auto_surfaced` for auditability. Surfaced 0.9.11 cross-opp validation: `turmeric-dogfood-20260427`'s clean PM-authored idea.md scored gracefully at 9.78 by treating PDD's Open Questions as analog, but the original 9.5 anchors were a poor fit (the dimension was effectively measuring something different from what the rubric claimed). | ACE team (0.10.9) | -| 2026-05-08 | **Rubric expansion: 7 → 11 dimensions, viability axis added (40% weight).** Surfaced by canopy's holistic_adversarial probe on turmeric run 20260507-1733: rubric scored 8.65/10 on a PDD an adversarial PM-style read scored 3/10 viability (3-to-1 against on the $10K bet). 5.65-point gap = rubric was grading document quality almost exclusively. Added 4 viability dimensions: `demand_reality` (15%, named downstream consumer with pre-committed action — biggest single gap), `resource_realism` (10%, budget vs labor at recruitment-realistic rates), `mission_alignment` (5%, do Primary metrics measure the goal or a process proxy), `fallback_validates_primary` (5%, is the named fallback a real validation harness or a parallel sampling system). Reweighted: stress_test_agreement 25→10%, reviewer_comment_fidelity 20→10%, structural_completeness 15→10%, archetype_coherence 15→10%, numbers_present 10→5%; numbers_consistent + feasibility_headline_metrics held. Pairs with canopy PR #38 (lens-types/judge.md adds rubric_blind_spot signal that drove this expansion). | ACE team (0.13.81) | | 2026-05-08 | **Rubric cleanup: 11 → 10 dimensions; weight-sum bug fix; viability rebalanced to 50%.** Three fixes in one edit: (1) Removed `stress_test_agreement` (10%) — it was structurally tautological (same model applies same rubric twice; cross-model probe confirmed it doesn't discriminate, scoring 8-10 on every grade with variance from rubric ambiguity not from real artifact differences). (2) Folded `numbers_present` (5%) into `numbers_consistent` (10%) since they cover the same axis and `numbers_present` was already a soft check most PDDs trivially pass. (3) Fixed the 0.13.81 weight-sum bug: weights summed to 0.95 not 1.0. New weights cleanly total 1.00 with viability at 50%: `demand_reality` 15→20%, `resource_realism` 10→15%, `mission_alignment` 5→10%, `fallback_validates_primary` held at 5%, `feasibility_headline_metrics` 5→10%. Verification (independent re-grade on turmeric PDD with the new 11-dim rubric scored 7.55 vs old rubric's 8.65 — confirming the viability axis discriminates). | ACE team (0.13.84) | | 2026-05-08 | **QA/Eval split: removed `structural_completeness` (10%) — now lives in new `idea-to-pdd-qa` skill.** First migration of the QA/Eval split principle (PR #146). Structural completeness was a static check (regex over `## Heading` lines for the 11 required sections); moved to `skills/idea-to-pdd-qa/checks.ts` as `checkAllRequiredSectionsPresent`. The eval rubric is now quality-only: 4 doc/fidelity dimensions (40%) + 5 viability dimensions (60%). Removed weight (10%) was redistributed to viability dimensions: `demand_reality` 20→22%, `resource_realism` 15→17%, `mission_alignment` 10→12%, `fallback_validates_primary` 5→9%. QA gates eval — eval is skipped (`verdict: incomplete`) if QA fails irrecoverably. Updated dimension descriptions to clarify which structural concerns moved to QA (annotated inline). | ACE team (0.13.88) | | 2026-05-22 | **Retire `idea.md` references.** The optional `idea.md` operator-seed input was removed from `idea-to-pdd`; this rubric loses its dual-input language. `clean_source` detection mechanism is unchanged (still keys off reviewer-comment presence across the source pack); language switched from "idea.md" to "the source pack" throughout. No scoring or weighting change. | ACE team | diff --git a/skills/idea-to-pdd/SKILL.md b/skills/idea-to-pdd/SKILL.md index 65b0569d..e5a21120 100644 --- a/skills/idea-to-pdd/SKILL.md +++ b/skills/idea-to-pdd/SKILL.md @@ -440,17 +440,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-08 | Replace weak self-eval with 5-question stress-test rubric (executability, verifiability, measurability, stage-gate clarity, resource realism); block at ≥2 non-pass; include grading anchors from vaccine-hesitancy and turmeric example PDDs; emit stress-test results as PDD appendix | ACE team (PM scout, focus-group framework lens) | -| 2026-04-15 | Fail fast with actionable error if `idea.md` is missing instead of improvising an idea | ACE team (PM scout, end-to-end UX lens) | -| 2026-04-17 | Emit gate brief at `ACE//runs//1-design/idea-to-pdd_gate-brief.md` so the review-mode gate presents a checklist + stress-test concerns instead of a bare "approve PDD?" prompt | ACE team (PM scout, internal-admin lens) | -| 2026-04-20 | Extract stress-test rubric from Process step 5 into standalone `## LLM-as-Judge Rubric` section per author contract; process step now references the section | ACE team (skills review) | -| 2026-05-05 | Replace single-`idea.md` input contract with multi-doc evidence-pack model: read `inputs-manifest.yaml` (at the run-folder root) (orchestrator-emitted) and synthesize the PDD from every file under `inputs/`. Optional `idea.md` at the run root is now a `--idea FILE\|-` operator seed only. The PDD is the formal output of Phase 1, never an input. | ACE team (LEEP run; user observation that PDD is an output not an input) | -| 2026-05-08 | Replace `## Open Questions Convention` with `## Decisions Log Convention`. Skill always emits `decisions.yaml` with the 14-row calibrated Phase 1 set covering archetype, FLW count, budget plausibility, payment rate, pilot size, AI threshold, AI fallback design, named consumer, primary-metric-vs-goal, language, evidence layers, solicitation defaults, candidate roster. Schema defined in `lib/decisions-schema.ts`; ground-truth fixture in `test/skills/idea-to-pdd/fixtures/turmeric-decisions.yaml`. Renderer + round-trip ship in PRs #2–#4. | ACE team | -| 2026-05-08 | Retrofit: replace `### Required Phase 1 row set` (14 hardcoded rows) with `### Anchor decisions` (5 rows tied to specific eval rubric dimensions) + `### Recommended additional rows` (illustrative, non-binding). Bar criterion is the sole filter; anchors are the only required surface. Process step adds renderer invocation; gate brief links the gdoc rendering instead of the YAML. | ACE team (decisions-log PR #2) | -| 2026-05-08 | Retire the "anchor" framing: collapse the two sub-sections into a single `### Common load-bearing decisions for Phase 1` template (14 rows, 5 of which feed `idea-to-pdd-eval`'s viability axis). Soften "MUST emit" wording in process step 3a — bar criterion is the sole filter; the catalog is a teaching template that improves over time. | ACE team (decisions-log PR #5) | -| 2026-05-15 | Branch the Common load-bearing decisions catalog by archetype: base table + `atomic-visit` / `focus-group` / `multi-stage` additive tables. Prompted by `malaria-itn-fgd/20260514-2007` where rows like `ai-photo-threshold` had no meaning for an FGD and FGD-relevant rows (`payment-unit-model`, `submission-window`, `audio-consent-fallback`, `site-selection`, etc.) had to be authored ad-hoc outside the catalog. See jjackson/ace#301. | ACE team | -| 2026-05-15 | **Recharacterize `focus-group` archetype to attestation-form-only.** `## Archetypes § focus-group` updated: training surface is OCS chatbot + handbook gdoc + practice-session audio review (NOT a Learn app); Output Specification is the gdoc structure, not Deliver-app form fields; new "Attestation form fields" question references the canonical default in `pdd-to-deliver-app`. Decisions Log: `facilitator-training-stipend` re-pegged to practice-session-pass; new `gdoc-content-template` row; `submission-window` clarified as attestation-form submission. Prompted by `malaria-itn-fgd/20260514-2007` post-run reframe. See `docs/superpowers/specs/2026-05-15-focus-group-archetype-redefinition.md`. | ACE team | | 2026-05-15 | Pare attestation-form-fields question + Decisions Log to match the 5-field form: consent / date / venue / GPS / photo. Audio is out-of-band; gdoc_link is removed (gdoc is written after submission). Add `gps-verification-radius` and `gdoc-submission-window` decisions; recharacterize `audio-min-duration` and `audio-consent-fallback` as facilitator-protocol concerns (out-of-band, not in the form). | ACE team | | 2026-05-15 | Recharacterize `payment-rate` and `per-session-rate` Decisions Log rows: PDD captures a **range** (not a fixed number), and the actual rate is **negotiated via the solicitation response** where the LLO proposes a number with rationale. The awarded LLO's proposed rate becomes the `connect.deliver_unit` payment_unit amount at Phase 4 setup. Pairs with `solicitation-create/SKILL.md § Process`'s "per-unit payment is negotiated, not declared" design principle. | ACE team | | 2026-05-22 | **Retire the optional `idea.md` operator-seed input.** The 2026-05-05 refactor reduced `idea.md` to an optional `--idea FILE\|-` seed alongside the `inputs/` evidence pack; the dual-path persisted but was rarely used in practice and added cognitive load (eval rubric branches, manifest-vs-idea precedence, permission-scan URL extraction). Operators now put any free-text seed directly into `inputs/` as a regular source file. Removed: optional table row, idea.md read paragraph, idea.md-URL permission scan, "or no idea.md" branch of the missing-source error. The `--idea` flag and run-root `idea.md` artifact are gone. | ACE team | diff --git a/skills/learnings-summary/SKILL.md b/skills/learnings-summary/SKILL.md index a61efaf1..c3432ade 100644 --- a/skills/learnings-summary/SKILL.md +++ b/skills/learnings-summary/SKILL.md @@ -83,9 +83,3 @@ When `--dry-run` is active: - Learnings analysis and new PDD generation proceed normally (written to GDrive) - Write any notification emails (to admin group) to `comms-log/dry-run-learnings-summary.md` instead of sending - State tracks as `dry-run-success` - -## Change Log - -| Date | Change | Author | -|------|--------|--------| -| 2026-04-03 | Initial version | ACE team | diff --git a/skills/llo-feedback/SKILL.md b/skills/llo-feedback/SKILL.md index c7c99743..c19117d7 100644 --- a/skills/llo-feedback/SKILL.md +++ b/skills/llo-feedback/SKILL.md @@ -136,5 +136,4 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-20 | Added `## Archetypes` with per-archetype feedback questions. `focus-group` replaces the app-usability / FLW-experience block with question-guide quality, facilitation experience, audio+upload workflow, participant recruitment, and session cadence. `multi-stage` asks per-stage questions plus cross-stage transition quality. Prevents the "facilitator asked about a Learn app they never used" anti-pattern that drifts responses thin | ACE team | diff --git a/skills/llo-launch/SKILL.md b/skills/llo-launch/SKILL.md index e96c16b5..9d687ccd 100644 --- a/skills/llo-launch/SKILL.md +++ b/skills/llo-launch/SKILL.md @@ -342,12 +342,6 @@ Each row this skill writes uses `phase: 9-execution-management` and | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-17 | Emit gate brief at `ACE//runs//7-execution-manager/llo-launch_gate-brief.md` *before* activation so the highest-stakes gate is approved on readiness, not retrospectively on a launch record | ACE team (PM scout, internal-admin lens) | -| 2026-04-20 | Added `## Archetypes` with per-archetype readiness checks, Connect activation semantics, launch email subject + body, and launch-record details. `focus-group` replaces "apps published" with "Session 1 venue + recording + participant recruitment confirmed" and subject flips to "Session 1 is on the calendar" (not "You Are Live" which is FLW-coded). `multi-stage` pins activation to Stage 1 only; each stage gets its own launch run, records preserved per-stage in `launch-record-stage-N.md`. Gate-brief checklist item 3 swaps in archetype-specific bullet | ACE team | -| 2026-04-28 | Replace HITL workaround with `connect_activate_opportunity` + `connect_get_opportunity` (ace-connect 0.8.1) | ACE team | -| 2026-04-30 | Switch `connect_activate_opportunity` to `POST /api/opportunities//activate/` (commcare-connect PR #1135). Server-side guards now reject activation if no PaymentUnits exist or the opp has ended; clearer errors than the silent edit-form fallback. Step 4 also gains a deferred FLW pre-invite path for ACE-driven dogfood runs whose `connect-opp-setup` deferred the invite until activation. (0.10.47) | ACE team | -| 2026-05-04 | Add the deep-QA verdict freshness gate (new Step 4) before activation: refuse to activate unless `verdicts/ocs-chatbot-eval-deep.yaml` and `verdicts/app-ux-eval-deep.yaml` exist, both pass, and both are newer than the artifacts they grade (OCS chatbot `version_number`; learn/deliver `build_id` from `deployment-summary.md`). Add `--override-deep-qa-gate=` operator escape hatch with a required reason and an audit trail to `comms-log/observations.md`; reachable only via `/ace:step llo-launch`, never `/ace:run`. Gate-brief auto-surfaced concerns gain two `[BLOCKER]` rows mirroring the gate. Part of the shallow/deep QA split refactor (spec: `docs/superpowers/specs/2026-05-04-shallow-deep-qa-split-design.md`). | ACE team | | 2026-05-05 | **Path-scheme migration on the deep-QA gate.** Step 4 verdict reads, error messages, and gate-brief BLOCKER rows now reference `5-ocs/ocs-chatbot-eval_verdict-deep.yaml` and `6-qa-and-training/app-ux-eval_verdict-deep.yaml` (per the manifest); freshness check pulls build IDs from `3-commcare/app-deploy_summary.md`. Wiring fix — the prior `verdicts/...` paths no longer exist on disk, so the gate would always fail with "verdict missing" against current main. No behavior change beyond paths. | ACE team | | 2026-05-08 | Add `## Decisions Log` section: 4 anchor rows mapped 1:1 to `llo-launch-eval`'s viability axis (llo-capacity-actual, day-one-readiness, downstream-handoff-alignment, stop-loss-planning) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | | 2026-05-10 | Drop the deferred FLW pre-invite path: `connect-opp-setup` (Phase 4 Step 7) now invites `${ACE_E2E_PHONE}` directly after activating the opp in Phase 4 Step 6.5. Step 6 here is reframed from "activate the opp" to "confirm the opp is active" — the idempotent skip-if-active path is now the canonical case; the active-otherwise branch is a fallback for the rare operator-deactivated case. No behavior change for real-LLO invites (still sent in this skill); behavior change for ACE test-user invites (no longer rescued here). Closes the Phase-6-placeholder-screenshots chicken-and-egg. | ACE team | diff --git a/skills/llo-onboarding/SKILL.md b/skills/llo-onboarding/SKILL.md index 35d3372d..2e6f5b03 100644 --- a/skills/llo-onboarding/SKILL.md +++ b/skills/llo-onboarding/SKILL.md @@ -220,9 +220,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-14 | Own Connect system invite send (moved from `llo-invite`) and include OCS widget link in onboarding email; this is the first LLO-facing step in the lifecycle | ACE team | -| 2026-04-20 | Added `## Archetypes` section with per-archetype email framing, "getting started" steps, and timeline language. `focus-group` addresses the recipient as a facilitator-owning org (not FLW-managing), leads with question guide + audio upload, and uses session-count cadence language. `multi-stage` front-loads Stage 1 content. Prevents atomic-visit framing from landing as the first LLO-facing artifact on FGD opps | ACE team | | 2026-04-28 | Replace HITL workaround with `connect_send_llo_invite` (ace-connect 0.8.1). Connect's invite is program-level, so the atom takes the program UUID and an `organization` slug for the target LLO workspace | ACE team | | 2026-04-30 | Switch `connect_send_llo_invite` to `POST /api/programs//applications/` (commcare-connect PR #1135). Args drop `contact_email` (server emails workspace admins via `send_program_invite_email`). Add new step 2a: `connect_accept_program_application` for ACE-driven dogfood runs that need to auto-accept the invite. (0.10.47) | ACE team | | 2026-05-04 | Read awardee from `opp.yaml.selected_llo` instead of iterating `connect-setup/invites.md` roster. Phase 9 entry guard halts with an actionable message if `selected_llo.org_slug` is null (Phase 8 `solicitation-review` must run first). Single-org onboarding replaces multi-LLO roster model. (0.12.0) | ACE team | diff --git a/skills/llo-uat/SKILL.md b/skills/llo-uat/SKILL.md index d6cf680b..126bef39 100644 --- a/skills/llo-uat/SKILL.md +++ b/skills/llo-uat/SKILL.md @@ -157,5 +157,4 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-20 | Added `## Archetypes` with per-archetype UAT checklists + sign-off criteria. `focus-group` replaces "test the apps" with "dry-run a facilitation session" (question guide + recording + consent + write-up + logistics); sign-off is "you could run Session 1 tomorrow." `multi-stage` uses per-stage checklists — full UAT for Stage 1, reference-only for later stages. Prevents LLOs from getting "download the app" UAT instructions when they have no app to download | ACE team | diff --git a/skills/ocs-agent-setup/SKILL.md b/skills/ocs-agent-setup/SKILL.md index f3fc9dc1..d1f27641 100644 --- a/skills/ocs-agent-setup/SKILL.md +++ b/skills/ocs-agent-setup/SKILL.md @@ -290,12 +290,6 @@ Each row this skill writes uses `phase: 5-ocs` and | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version (manual workaround) | ACE team | -| 2026-04-08 | Full rewrite against OCS MCP composite backend | ACE team | -| 2026-04-14 | Removed inline LLM-as-Judge self-eval and connect-setup handoff; quality gating + Connect widget handoff now live in the `ocs-setup` Phase 5 agent | ACE team | -| 2026-04-27 | Step 2 idempotency uses the integer `experiment_id` returned by `ocs_list_chatbots` (0.5.19 — no more orphan re-clones). Step 7 explicitly requires `{collection_index_summaries}` in the system prompt; MCP `ocs_attach_knowledge` pre-flights this and fails with a typed error otherwise. | ACE team | -| 2026-04-28 | Step 8 collapsed into a single `ocs_set_chatbot_pipeline` call (0.6.4 — transactional save). Closes the chicken-and-egg surfaced in the 2026-04-27 dogfood where `set_chatbot_system_prompt` followed by `attach_knowledge` (or vice versa) hit OCS cross-field validation on the intermediate save. | ACE team | -| 2026-04-28 | Step 7 prompt rule corrected (0.6.10): `{collection_index_summaries}` is required iff `collection_index_ids.length >= 2` (verified via live OCS probe — see `scripts/probe-n1-cross-test.ts`). Single-collection clones must NOT include the variable; multi-collection clones MUST. The 0.6.4 framing (variable iff non-empty) was wrong. | ACE team | | 2026-05-05 | **Two idempotency improvements.** (1) New Step 0 reads the local state file (`runs//5-ocs/ocs-agent-setup.md`) before any OCS call — saves ~1s on a normal re-run and avoids the silent-pipeline-walk on `--prompt-patch` re-runs. (2) New `--prompt-patch` mode reuses the existing chatbot/collection/files, skipping clone + create-collection + upload + 5–10 min indexing wait, and just recomposes the prompt → calls `ocs_set_chatbot_pipeline` → publishes. This is the canonical Phase 5 retry path after `ocs-chatbot-eval --quick` flags a prompt issue (the previous skill prose said the agent should "retry prompt-patch" but no such mode existed — re-runs walked the full pipeline). | ACE team | | 2026-05-08 | Add `## Decisions Log` section: 3 anchor rows (system-prompt-baseline, rag-collection-scope, test-prompt-count) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | | 2026-05-15 | Add `2-scenarios/pdd-to-test-prompts.md` to the canonical KB recipe (Step 5); add archetype-aware "primary vs supplementary surface" line to the system-prompt composition checklist (Step 7) — for `focus-group`, the chatbot is the primary facilitator training + post-session writing surface. Make `6-qa-and-training/*` reads tolerant of missing files (Phase 6 may not have run yet in `/ace:run` flow). Prompted by `malaria-itn-fgd/20260514-2352` Phase 5 agent observations. | ACE team | diff --git a/skills/ocs-chatbot-eval/SKILL.md b/skills/ocs-chatbot-eval/SKILL.md index d4ce56d2..b4cc77de 100644 --- a/skills/ocs-chatbot-eval/SKILL.md +++ b/skills/ocs-chatbot-eval/SKILL.md @@ -489,10 +489,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-19 | Initial version — split out from `ocs-chatbot-qa` as the judge half of the qa/eval pair. Reads transcripts from `qa-captures/`, writes `verdicts/`, `eval-reports/`, `gate-briefs/ocs-chatbot-eval-deep.md`. Gate now sits on eval, not qa | ACE team (qa/eval split refactor) | -| 2026-04-19 | Rename per-item verdict key `per_prompt` → `per_item` (canonical per `skills/README.md § QA vs Eval`); add `prompt:` as domain-specific subkey inside each entry; document `ACE/golden-template/` no-opp fallback path; document `auto_surfaced:` block contract (inputs to the gate brief) | ACE team (qa/eval iteration loop) | -| 2026-04-29 | Source-usage dimension now branches on the transcript's `Capture method:` header. Widget-captured transcripts grade body-text grounding (does the response name source docs by title?) and emit `[PLATFORM] empty cited_files expected on widget capture` instead of binding the empty-`cited_files` cap. OpenAI-compat captures keep the existing two-tier cap. The original cap conflated bot grounding gaps with widget-API measurement limitations and fired on every widget transcript regardless of bot quality, costing 5+ points on captures that were actually grounded. Surfaced 0.9.11 cross-opp validation against `turmeric-dogfood-20260427`. | ACE team (0.10.10) | -| 2026-05-04 | **Thinned `--quick` to a single-dimension rubric.** `--quick` mode now scores one `overall_quality_0_to_3` dimension per prompt with pass criterion `every prompt ≥ 2/3`. `--deep` and `--monitor` still use the calibrated 5-dimension rubric. Phase 5 cost reduction: 3 prompts × 1 dim = 3 LLM judge calls (vs 5 prompts × 5 dims = ~25). Multi-dimensional judging moves to deep-only — the `--deep` mode is now invoked only from `/ace:qa-deep` and gates Phase 8 `llo-launch` activation. Verdict file path unchanged (`verdicts/ocs-chatbot-eval-quick.yaml`); the `dimensions` array now has 1 entry. | ACE team | | 2026-05-04 | **`--quick` now writes a gate brief.** `--quick` mode emits `gate-briefs/ocs-chatbot-eval-quick.md` so the orchestrator's Phase 5→6 gate lookup resolves (post-Task-6 contract). Defined the quick-mode brief shape inline (single dimension, 3 prompts, no multi-dim breakdown). `--monitor` still does not produce a gate brief. Final-review followup to the shallow/deep QA split. | ACE team | | 2026-05-05 | **Path-scheme migration.** All read/write paths repointed to `runs///ocs-chatbot-eval_*-.` per the manifest (`5-ocs/` for `--quick`/`--deep`; `7-execution-manager/` for `--monitor`). Retires the opp-level `qa-captures/` / `verdicts/` / `eval-reports/` / `gate-briefs/` directories. Updated: Modes table, Step 1 transcript locator + golden-template fallback path, Step 4 verdict output, Step 6 report output, Step 7 trend path, Step 8 gate-brief output, Gate Brief artifact-under-review for both modes, the deep + quick verdict YAML examples (`capture_path` field), and the worked Quick example. No behavior change beyond paths. | ACE team | | 2026-05-05 | **Rubric prose extracted.** The 5-dimension table cells were ~600 words each, packing per-dimension criteria with hard deductions, multi-tier caps, capture-method branches, and suite-level rules into single rows. The dimension table now carries a one-line summary plus a pointer to a new `## Rubric Rules` section that breaks each dimension into labeled subsections (Correctness, Source usage with `openai-compat` / `widget` branches, Refusal correctness with tiered cap table, Tone, Tagging) plus a Suite level subsection (Inflation guard, Pre/post-cap reporting). Same grading semantics — every existing rule, deduction, and cap is preserved verbatim under its own heading. Rationale: LLM judges miss rules buried in dense prose; labeled subsections give the rubric visible structure. | ACE team | diff --git a/skills/ocs-chatbot-qa/SKILL.md b/skills/ocs-chatbot-qa/SKILL.md index a7e71b9d..a81e9322 100644 --- a/skills/ocs-chatbot-qa/SKILL.md +++ b/skills/ocs-chatbot-qa/SKILL.md @@ -381,14 +381,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-10 | Initial version | ACE team | -| 2026-04-14 | Added --quick / --deep / --monitor modes; --quick replaces the inline self-eval previously in `ocs-agent-setup`; --deep is the pre-launch gate in Phase 5; --monitor runs recurring in Phase 6 | ACE team | -| 2026-04-17 | `--deep` emits gate brief at `ACE//runs//5-ocs/ocs-chatbot-eval_gate-brief-deep.md`; `--quick` and `--monitor` do not | ACE team (PM scout, internal-admin lens) | -| 2026-04-19 | **QA/eval split.** Removed LLM-as-Judge; this skill now captures transcripts + structural checks only. Writes to `qa-captures/` (renamed from embedded report). Gate brief ownership moved to new `ocs-chatbot-eval` skill. See `skills/README.md § QA vs Eval — the two-phase pattern` | ACE team (qa/eval split refactor) | -| 2026-04-19 | Document `ACE/golden-template/` as the canonical no-opp fallback path; make env-source of `$OCS_GOLDEN_TEMPLATE_ID` explicit (`$CLAUDE_PLUGIN_DATA/.env`); call out that `ocs_send_test_message` MCP tool is structurally incomplete for the transcript schema — stick to raw widget HTTP. Surfaced during first real qa/eval split exercise against the golden template | ACE team (qa/eval iteration loop) | -| 2026-04-29 | Added `Capture method:` header field to the transcript schema (`widget` for the anonymous widget endpoint this skill uses today; `openai-compat` reserved for the OpenAI-compatible endpoint when capture for that endpoint lands). `ocs-chatbot-eval` branches its source-usage rubric on this field — without it, the rubric can't tell whether an empty `cited_files` indicates a real grounding gap (openai-compat path) or a measurement limitation (widget path, where the API never returns inline citations regardless). | ACE team (0.10.10) | -| 2026-05-03 | **Time-box, incremental writes, resume-from-partial, liveness probe.** Added `## Wall-Clock Budget` (per-prompt 90s, suite-cap `min(90s × N, 30 min)`, 3-prompt circuit-breaker, no `ScheduleWakeup`). Renumbered Process: new Step 2 mandatory `ocs_send_test_message` liveness probe before suite (catches dead session before budget burns); new Step 3 reads any existing transcript and skips already-captured prompts (idempotent re-runs); Step 5 chat loop now writes each entry to Drive incrementally via `drive_update_file` + `revisionVersion` CAS so a mid-loop kill doesn't lose data; Step 7 is a metadata-only flush. Header schema gains `Prompts captured`, `Prompts remaining`, `Complete`, `Suite elapsed` fields; partial transcripts are graded by eval as `incomplete-coverage` rather than failing. Surfaced after the `turmeric-20260503-0835` deep capture spun for 3+ hours on a fictional bg task; the prior all-or-nothing write meant zero recoverable evidence. | ACE team (0.11.6) | -| 2026-05-04 | Thinned from 5 to 3 prompts. Phase 5 cost reduction; multi-dimensional judging moves to deep-only. `--quick` is now 3 universal Connect-domain prompts (claim opp, sync data, get paid) with a hard 270s wall-clock cap (90s × 3). The `--deep` mode is no longer dispatched from Phase 5 — it lives in the manual `/ace:qa-deep ` command and is the Phase 8 `llo-launch` activation gate. | ACE team | | 2026-05-05 | **Path-scheme migration.** Transcripts now write to `runs//5-ocs/ocs-chatbot-qa_transcript-.md` (or `7-execution-manager/...` for `--monitor`), per the manifest. The opp-level `qa-captures/` directory is retired; the only surviving use of the dated `qa-captures/` form is the golden-template no-opp fallback (`ACE/golden-template/qa-captures/.md`). Resume-from-partial check (Step 3) re-pointed at the new path. No behavior change beyond paths. | ACE team | | 2026-05-05 | **`--quick` switched to single-shot write.** Buffer entries in memory and call `drive_create_file` once at suite end (Step 7). Reduces Drive RTTs on `--quick` from N+1 (read+write per prompt + metadata) to 1. The incremental CAS-write strategy still applies on `--deep`/`--monitor` where 15–30 min suite runtimes make resume-from-partial worth the cost. Step 3 resume-from-partial is a `--deep`/`--monitor`-only step now (`--quick`'s 270s cap is short enough that re-running is cheaper than the resume bookkeeping). | ACE team | | 2026-05-15 | Extend `--quick` suite with archetype-specific prompts for `focus-group` (1–2 from `pdd-to-test-prompts.md` `gdoc-writing-guidance` + `facilitation-technique` categories) since the 3 universal Connect-domain prompts primarily exercise shared-collection retrieval and would pass even if the opp-specific collection was mis-loaded. Wall-clock cap scales to 360s/450s for focus-group. Atomic-visit / multi-stage stay at the 3-prompt / 270s baseline. Prompted by `malaria-itn-fgd/20260514-2352` Phase 5 observation. | ACE team | diff --git a/skills/opp-closeout/SKILL.md b/skills/opp-closeout/SKILL.md index 2bfe9587..2fcdfef4 100644 --- a/skills/opp-closeout/SKILL.md +++ b/skills/opp-closeout/SKILL.md @@ -84,6 +84,5 @@ Each row this skill writes uses `phase: 10-closeout` and | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | | 2026-04-28 | Replace HITL workaround with `connect_list_invoices` + `connect_get_invoice` (ace-connect 0.8.1). Note: invoice page shape was not yet probed at 0.8.1 ship; atoms return conservative defaults until the page has been observed live | ACE team | | 2026-05-08 | Add `## Decisions Log` section: 2 anchor rows (closeout-depth, learnings-summary-scope) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | diff --git a/skills/pdd-to-app-journeys/SKILL.md b/skills/pdd-to-app-journeys/SKILL.md index d4917e82..72704efb 100644 --- a/skills/pdd-to-app-journeys/SKILL.md +++ b/skills/pdd-to-app-journeys/SKILL.md @@ -289,8 +289,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-05-04 | Initial version — Phase 1 producer of `expected-journeys.md`, the UX-intent ground truth that `app-test-cases` (Phase 3) and `app-ux-eval` (deep QA) consume. Mirror of `pdd-to-test-prompts` for the app side. Introduced as part of the shallow/deep QA split (spec: `docs/superpowers/specs/2026-05-04-shallow-deep-qa-split-design.md`) | ACE team | -| 2026-05-08 | Output path corrected to `2-scenarios/pdd-to-app-journeys.md` (was `expected-journeys.md` at the run root). Aligns with `lib/artifact-manifest.ts:220`, the QA + eval skills, and `agents/design-review.md`. Consumers (`app-test-cases`, `app-ux-eval`, training cluster, `synthetic-narrative-plan`) updated in the same PR. | ACE team | | 2026-05-08 | **No QA companion.** `pdd-to-app-journeys-qa` removed (PR #160) — downstream consumers are LLM-driven; structural label-format checks gate nothing real, and the eval already covers the substantive concerns. See `skills/_qa-decisions.md` for the registry entry + revisit conditions, and `docs/learnings/2026-05-08-fake-qa-detection.md` for the heuristic. | ACE team | | 2026-05-15 | Accept either "FLW Requirements" (canonical, per `templates/pdd-template.md`) or "Target FLW" (legacy) as the persona section in Process step 3 + Failure Modes. Prompted by `malaria-itn-fgd/20260514-2007` where the template-conformant PDD said "FLW Requirements" and the skill halted looking for "Target FLW". See jjackson/ace#302. | ACE team | | 2026-05-15 | Recharacterize `focus-group` journey categories for the attestation-form-only shape (PRs #305, #306): `output-coherence` (which assumed the FLW fills 28 in-app fields with content) → `attestation-submission` (FLW fills the 5-field form at session end, no per-section content in the app). Session-setup reframed to note "no in-app interaction at session start" — the mobile form is end-of-session only. Other categories (recruitment-failure, consent-handling) reframed to note no-attestation-on-abort semantics. Coverage rule updated to reference the new category name. Prompted by `malaria-itn-fgd/20260514-2352` re-run. | ACE team | diff --git a/skills/pdd-to-deliver-app/SKILL.md b/skills/pdd-to-deliver-app/SKILL.md index dcc9673a..4ed2fe3c 100644 --- a/skills/pdd-to-deliver-app/SKILL.md +++ b/skills/pdd-to-deliver-app/SKILL.md @@ -497,10 +497,6 @@ Each row this skill writes uses `phase: 3-commcare` and | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-08 | Add `## Archetypes` section: `atomic-visit` (per-beneficiary form), `focus-group` (per-session documentation form, segment-level case), `multi-stage` (per-stage branching) | ACE team (PM scout, focus-group framework lens) | -| 2026-04-27 | Switch from manual Nova UI handoff to `/nova:autobuild` via the Nova plugin. Output is now `nova_app_id` written to the summary, not a JSON file. The `apps/deliver-app.json` snapshot is no longer required. | ACE team | -| 2026-05-08 | Add `## Decisions Log` section: 3 anchor rows (deliver-unit-count, one-form-per-module-workaround, multimedia-coverage-strategy) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | | 2026-05-15 | Tighten Step 4a (post-build field-count verification) from "the in-context LLM must..." prose into a numbered tool-call recipe. Mirrors the same change in `pdd-to-learn-app/SKILL.md`. Prompted by `malaria-itn-fgd/20260514-2007` Learn-app cert-assessment partial-persistence (FGD Deliver apps with the ~45-70-field per-section summary form are the highest-risk surface for the same class). See jjackson/ace#303. | ACE team | | 2026-05-15 | **focus-group archetype rewritten to attestation-form-only.** Previously: 3-module / 69-field per-section-summary Deliver app capturing all qualitative content in CommCare. New: one module, one ~14-field attestation form (date / venue / participants / audio / photo / gdoc link / consent / reflection). Content lives in a Google Doc out-of-band; the gdoc_link field is the bridge. One submission = one payment trigger. Prompted by post-run reframe from operator: "all the content collection... will happen manually and they will send us a gdoc". See `docs/superpowers/specs/2026-05-15-focus-group-archetype-redefinition.md`. | ACE team | | 2026-05-15 | **Pare focus-group attestation form to 5 fields:** `consent_all_participants` (single_select yes/no, validate=yes), `session_date`, `venue` (text), `gps` (geopoint), `photo` (image). Drop `audio_file` / `backup_audio_file` (audio capture is out-of-band; not in CommCare), `gdoc_link` (gdoc is written AFTER session end, no linkable URL exists at submission time), and the metadata fields (`llo_name`, `site_*`, `venue_type`, `planned_segment`, `actual_participant_count`, `start_time`, `end_time`, `audio_duration_minutes`, `facilitator_reflection`, `pre_checklist_complete`) — these go in the gdoc. Matching attestation → gdoc is coordinator-driven by `(FLW, session_date, venue)` tuple. Prompted by operator: "For the fields just have consent (this should confirm you have consent from all participants), date, venue, gps, photo. everything else is either wrong or goes into the gdoc. the gdoc will be created after the fact so no ability to enter it into commcare". | ACE team | diff --git a/skills/pdd-to-learn-app/SKILL.md b/skills/pdd-to-learn-app/SKILL.md index 7d465c55..6e9329c7 100644 --- a/skills/pdd-to-learn-app/SKILL.md +++ b/skills/pdd-to-learn-app/SKILL.md @@ -465,10 +465,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-03 | Initial version | ACE team | -| 2026-04-08 | Add `## Archetypes` section: `atomic-visit` (form walkthrough), `focus-group` (facilitation craft training), `multi-stage` (per-stage branching) | ACE team (PM scout, focus-group framework lens) | -| 2026-04-27 | Switch from manual Nova UI handoff to `/nova:autobuild` via the Nova plugin. Output is now `nova_app_id` written to the summary, not a JSON file. The `apps/learn-app.json` snapshot is no longer required. | ACE team | -| 2026-05-15 | Tighten Step 4a (post-build field-count verification) from "the in-context LLM must..." prose into a numbered tool-call recipe. Prompted by `malaria-itn-fgd/20260514-2007` where the cert-assessment shipped 12/15 score fields + 0/1 user_score and the recipe didn't fire — `validate_app` caught it instead. Mirrored in `pdd-to-deliver-app/SKILL.md`. See jjackson/ace#303. | ACE team | | 2026-05-15 | **focus-group archetype becomes a no-op for this skill.** The FGD operational model captures content in a gdoc (not a CommCare form) and trains facilitators out-of-band (OCS chatbot + handbook gdoc + coordinator-graded practice-session audio review), so no Learn app is produced. Step 1a short-circuits with a `skipped` summary; § Archetypes § focus-group rewritten to document the skip. Prompted by `malaria-itn-fgd/20260514-2007` post-run reframe; see `docs/superpowers/specs/2026-05-15-focus-group-archetype-redefinition.md`. | ACE team | | 2026-05-15 | **focus-group switches from no-op to minimal sentinel pattern.** Re-run `malaria-itn-fgd/20260514-2352` Phase 4 surfaced a hard blocker: `connect_create_opportunity` requires `learn_app` at the schema, REST, and validator layers. Operator chose per-opp sentinel (one minimal 1-form readiness check, ~7 fields, both Connect markers, ~1-2 min build) over a server-side fix. Step 1a no longer short-circuits — focus-group runs the full skill flow but with the sentinel-shaped brief documented in § Archetypes § focus-group. Sentinel doubles as in-app readiness gate: facilitator must `acknowledge_readiness = yes` (coordinator-confirmed practice-session-pass) before they're cleared to submit attestations. | ACE team | | 2026-05-21 | **Forbid `` blocks in Learn forms.** Added a new REQUIRED paragraph to Step 3 instructing the architect to NOT declare `case_type` on Learn modules, NOT create cases from Learn registration forms, and NOT bind any field to a case property via `case_property_on`. Calibration scores / pass flags / `user_score` MUST live as form-level hidden fields only. Reason: `commcare-form-patch` (Step 8 wrapper-strip) hits `cchq-vellum-cache-drift` whenever a patched form carries a `` block — CCHQ's Vellum form-designer cache isn't refreshed by `edit_form_attr`, and `make_build` rejects with "Cannot use Case Management UI if you already have a case block in your form." Reproducer: `malaria-itn-app/20260521-1400` Phase 3 — architect bound `standardization_gate_cleared` + `*_passed` flags to case properties, all 6 Learn forms blocked at form-patch, Phase 6 then halted on Connect → Learn CCZ install with "Unknown failure during app install." Removal criteria: drop the rule when voidcraft-labs/nova-plugin#7 ships (no wrappers → no patcher → no drift class) OR when `commcare_patch_xform` gains Vellum-cache invalidation. | ACE team | diff --git a/skills/pdd-to-test-prompts/SKILL.md b/skills/pdd-to-test-prompts/SKILL.md index 697390d4..cbf231c8 100644 --- a/skills/pdd-to-test-prompts/SKILL.md +++ b/skills/pdd-to-test-prompts/SKILL.md @@ -242,7 +242,6 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-04-14 | Initial version — introduced as Phase 1 Step 2 so Phase 5's `ocs-chatbot-qa --deep` has ground-truth opp-specific prompts to grade against. Previously `test-prompts.md` was referenced by `ocs-chatbot-qa` but had no producer | ACE team | | 2026-04-19 | Added `## Archetypes` section branching on PDD archetype. `focus-group` gets session/recruitment/consent/question-guide/facilitation/output/audio categories; atomic-visit retains visit-flow/eligibility/GPS/duplicate categories; multi-stage mixes per-stage with an added stage-gate category. Motivated by cosmetics-fgd-pilot recon (2026-04-19) where the atomic-visit-only category list forced manual remapping | ACE team (qa/eval iteration loop) | | 2026-04-20 | Expand `multi-stage` archetype: clarify per-stage archetype dispatch, add intervention-continuity cross-stage category, flag missing Stage Gate as `[WARN]` | ACE team (skills review) | | 2026-05-15 | Recharacterize `focus-group` category list for the attestation-form-only shape (PRs #305, #306): `Output spec` → `Gdoc writing guidance` (the chatbot helps facilitators write the gdoc per PDD Output Spec); `Audio and evidence` → `Attestation form` (no audio in CommCare; 5-field form questions). `Facilitation technique` line drops the Learn-app reference (no Learn app for focus-group; OCS chatbot is the primary training surface). Prompted by `malaria-itn-fgd/20260514-2352` re-run where the Phase 2 agent surfaced these as small-tweak friction. | ACE team | diff --git a/skills/pdd-to-work-order-eval/SKILL.md b/skills/pdd-to-work-order-eval/SKILL.md index fc2fcb36..d9338299 100644 --- a/skills/pdd-to-work-order-eval/SKILL.md +++ b/skills/pdd-to-work-order-eval/SKILL.md @@ -95,5 +95,4 @@ Grading bands: 0 strikes = `pass`; 1–3 strikes = `partial`; 4+ strikes = `fail | Date | Change | Author | |------|--------|--------| -| 2026-05-21 | Initial version | ACE team | | 2026-05-21 | Add `writing_style` 6th dimension scoped to renderable conventions (acronyms, modals, voice, partner-naming, terminology). Reweight: 5 content dims × 0.17 + writing_style × 0.15 = 1.00. Bold deferred until `docs_finalize_bold` ships. | ACE team | diff --git a/skills/pdd-to-work-order-qa/SKILL.md b/skills/pdd-to-work-order-qa/SKILL.md index 7a906c9f..a19ee8f9 100644 --- a/skills/pdd-to-work-order-qa/SKILL.md +++ b/skills/pdd-to-work-order-qa/SKILL.md @@ -62,9 +62,3 @@ The static check functions live at `skills/pdd-to-work-order-qa/checks.ts` as im 6. **Compose and write the verdict YAML** to `1-design/pdd-to-work-order-qa_result.yaml` per the QA verdict schema (`lib/qa-types.ts`). `verdict: pass` iff every check passes; `verdict: fail` with `failures[]` array otherwise (each entry: `{check, detail, auto_fix_hint}`). `verdict: incomplete` if a check could not be evaluated (e.g., decisions.yaml unreadable). 7. **Trigger the producer-retry loop on `verdict: fail`** per `agents/idea-to-design.md § Step 2.4`. After retry: re-run QA. Halt with `verdict: incomplete` when the producer can no longer make progress on the same failures. - -## Change Log - -| Date | Change | Author | -|------|--------|--------| -| 2026-05-21 | Initial version | ACE team | diff --git a/skills/pdd-to-work-order/SKILL.md b/skills/pdd-to-work-order/SKILL.md index 65d8c382..69ac4b36 100644 --- a/skills/pdd-to-work-order/SKILL.md +++ b/skills/pdd-to-work-order/SKILL.md @@ -155,6 +155,5 @@ When `--dry-run` is active: | Date | Change | Author | |------|--------|--------| -| 2026-05-21 | Initial version | ACE team | | 2026-05-21 | Add `references/writing-style.md` + `references/style-guide.md`, adapted from `sarvesh-tewari/ace-skills-stewari`; wire writing-style.md into step 1 + prose-token synthesis | ACE team | | 2026-05-21 | Drop bold-span rule from prose-token synthesis preamble + add explicit "do not emit markdown bold" warning (template uses plain-text replaceAllText; no bold finalizer yet). Track `docs_finalize_bold` as backlog. | ACE team | diff --git a/skills/solicitation-create/SKILL.md b/skills/solicitation-create/SKILL.md index a13c9089..ca5d2cee 100644 --- a/skills/solicitation-create/SKILL.md +++ b/skills/solicitation-create/SKILL.md @@ -809,9 +809,6 @@ Each row this skill writes uses `phase: 8-solicitation-management` and | Date | Change | Author | |------|--------|--------| -| 2026-05-08 | Add `## Decisions Log` section: 3 anchor rows (solicitation-type, response-deadline, response-template-choice) + bar-criterion reference. Pairs with decisions-log PR #4 (Phase 3-10 writes). | ACE team (decisions-log PR #4) | -| 2026-05-15 | Three archetype-branches added for `focus-group`: (1) scope_of_work concatenation in Step 2 — FGD PDD has no `## Learn App Specification` (uses `## Facilitation Protocol` instead); the scope opens with a "PER VERIFIED SESSION, THREE ARTIFACTS" block listing audio + gdoc + 5-field attestation form with explicit "NOT in the form" callout. (2) evaluation_criteria in Step 3 — focus-group goes from a 4-axis sketch to a 6-axis starter rubric (qualitative-research experience, facilitator skill + language, homogeneous-group recruitment, coordinator gdoc-review capacity, audio handling out-of-band, timeline + per-session payment economics). (3) default questions in Step 3 — swap CHW-deployment vocabulary for qualitative-research vocabulary on q1 + q5 + q6. Prompted by `malaria-itn-fgd/20260514-2352` Phase 8 observations. | ACE team | -| 2026-05-15 | Codify the **"per-unit payment is negotiated, not declared"** design principle at the top of `## Process`. Solicitations express payment as a range with rationale in `scope_of_work` prose; the `questions` block asks the responding LLO to propose their actual rate + why. Closes the loop on the "labs `per_unit_payment` schema gap" surfaced in Phase 8 — it's not a gap, it's an intentional design choice (per-unit shape varies by archetype; the rate is opp-and-LLO-specific and negotiated through the response). | ACE team | | 2026-05-21 | **Work-order-as-primary-input + canonical-schema field names + comprehensive-content shape.** Three bundled rewrites prompted by solicitation 3130 on `malaria-itn-app/20260521-1400` where the public page rendered blank Description, "TBD" timeline, "No deadline," Python-list-repr Scope, and zero questions / zero rubric simultaneously. (1) Inputs now read Phase 1's work order (`1-design/pdd-to-work-order.gdoc`) as the primary content source + `decisions.yaml` for later run decisions, alongside the PDD (now used for problem-framing only, not for scope). (2) Field names migrated to the labs canonical schema (`description` not `overview`, `application_deadline` not `response_window_days`, `expected_start_date/_end_date` not `anticipated_*`, `estimated_scale` not `sample_target`, `questions[].text` not `response_questions[].question`, `evaluation_criteria[].name/.scoring_guide/.linked_questions` not `rubric[].dimension/.criterion`; `solicitation_type: 'eoi'` lowercase). Top-level fields not in `solicitations/models.py` (`pass_bar`, `eligibility_criteria`, `geographic_scope`, `per_hh_payment_band_usd`, `budget`) folded into `description`/`scope_of_work` prose. (3) Content shape demands comprehensive prose: `description` 500-800 words foundation-pitch tone; `scope_of_work` 600-1000+ words derived section-by-section from the work order with explicit de-prescription rules (exact dollars → ranges, exact weeks → windows); every question has a required `framing` field; every evaluation criterion has a required `scoring_guide` + `linked_questions`. (4) Added Step 7a — a curl-the-public-URL structural verifier that catches field-name drift at write time instead of at human-eye time. | ACE team | | 2026-05-22 | **Align with current labs reality + document the ideal end-state (`jjackson/connect-labs#212`).** Surfaced during the malaria-itn-app `20260521-1400` Phase 8 republish (solicitation 3140). Three labs-side gaps required inline workarounds: (a) atom inputSchema `{data: {...}}` shape vs deployed server's flat-fields validator — added Step 6's wire-shape fallback documenting both paths; (b) `questions[].framing` rejected as unknown key — adopted the literal `Why we're asking: \n\n` inline anchor convention, load-bearing for `solicitation-review` parsing; (c) `evaluation_criteria[].id` required by server but undocumented in atom inputSchema — added `id` to the required criterion shape with `slugify(name)` as the derivation fallback. Step 7a verifier rewritten from "curl the public URL" to "`get_solicitation` round-trip" because the labs public-detail page now 302s to login for unauthenticated visitors even when `is_public: true` (tracked in `connect-labs#212`). When all four labs items in #212 ship, drop the inline workarounds: emit `framing` as a structured key, drop the wire-shape fallback paragraph, restore the curl-the-public-URL verifier as a second post-round-trip check. | ACE team | | 2026-05-22 | **Architecture decision: ACE owns composition; labs validates.** PR #396 had floated a future labs-side `create_solicitation_from_brief` MCP tool that would compose content server-side via labs's `solicitation_agent`. Walked back — operator chose to keep composition in ACE so this skill retains full control over voice, archetype-branched scope, framing/scoring_guide quality, and decisions-log integration (all of which are ACE-context that labs would have to learn). Labs's tightened MCP (forthcoming deploy: `create_solicitation` + `update_solicitation` now validate the canonical schema and fail loudly with `INVALID_SCHEMA` + `error.details.fields` on drift) is the right server-side contribution: schema enforcement, not content generation. This skill is the long-term home for solicitation composition; Step 6's payload shape is bound to labs's `tools/list` inputSchema rather than to a future composer call. Removal of the prior "Removal criteria" line. | ACE team | diff --git a/skills/timeline-monitor/SKILL.md b/skills/timeline-monitor/SKILL.md index 02a020e7..060afec4 100644 --- a/skills/timeline-monitor/SKILL.md +++ b/skills/timeline-monitor/SKILL.md @@ -70,9 +70,3 @@ When `--dry-run` is active: - Monitoring report is still written to `ACE//monitoring/` as normal - Do not send emails to LLOs - State tracks as `dry-run-success` - -## Change Log - -| Date | Change | Author | -|------|--------|--------| -| 2026-04-03 | Initial version | ACE team |