Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
"url": "https://github.com/jjackson"
},
"metadata": {
"version": "0.13.277"
"version": "0.13.285"
},
"plugins": [
{
"name": "ace",
"source": "./",
"version": "0.13.277",
"version": "0.13.285",
"description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout"
}
]
Expand Down
2 changes: 1 addition & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "ace",
"version": "0.13.277",
"version": "0.13.285",
"description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout",
"author": {
"name": "Jonathan Jackson",
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ All notable changes to the ACE plugin will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and the plugin follows [semantic versioning](https://semver.org/spec/v2.0.0.html).

## 0.13.285 — 2026-05-19

**Add `file_path` mode to `ocs_upload_collection_files` — close the b64 context wedge that stalled Phase 5 twice.**

Two consecutive `ace:ocs-setup` dispatches on `leep-paint-collection/20260517-1515` hit stream-idle timeouts (one at ~30 min / 49 tool calls, second at ~114 min / 44 tool calls) without writing any Drive artifacts. Session-log bisect (see `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md`) pinned the cause: the agent built its RAG content pack on disk (~67 KB), `base64`-encoded it via Bash, then **`Read` the resulting `.b64` chunks back into its own context** so it could emit them as the `ocs_upload_collection_files` tool_use `input.files[].content` field. Generating 100s of KB of b64 as output tokens stalls model generation either mid-emit or on the next turn. No OCS slowness, no auth churn, no QA loop — pure output-token budget exhaustion.

Fix: `ocs_upload_collection_files` extended to accept `file_path` as an alternative source per file. The MCP reads the file server-side, no b64 ever crosses the agent's context. Each file MUST supply EXACTLY ONE of `content` (legacy inline b64) or `file_path` (absolute filesystem path); mixed or missing sources fail fast with a named error citing the offending file.

Refactor: the file-decoding logic moved into `decodeUploadCollectionFileSource`, exported for unit-testability. 7 new vitest cases (UTF-8 text via file_path, arbitrary binary via file_path, inline content legacy mode, missing source, both sources, ENOENT propagation, error names the offending file). All pass.

Skill-side guidance: Phase 5 `ocs-content-pack` and any future skill calling this atom SHOULD use `file_path` for any payload > ~1KB. For files on Drive, `drive_download_binary` into a tmp path first, then pass that as `file_path` — keeps the b64 entirely out of agent context.

`docs/learnings/2026-05-12-boundary-probe-registry.md` updated with the new Shipped probe + a new pending row generalizing the audit ("every MCP atom whose input schema takes a `string` that may carry > ~10KB of payload should have a `_path` companion"; existing examples: `commcare_upload_multimedia.file_bytes_path`, `commcare_patch_xform.new_xform_xml_path`).

## 0.13.277 — 2026-05-18

**Mirror Vellum's slug/name separation in the Nova architect brief (follow-up to 0.13.274).**
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.13.277
0.13.285
2 changes: 2 additions & 0 deletions docs/learnings/2026-05-12-boundary-probe-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ ACE has a pattern called **boundary probes**: load-bearing client-side pre-fligh
| `mobile_probe_maestro_driver` | `mcp/mobile-server.ts:143` (atom) → `mcp/mobile/client.ts:169` (impl); auto-invoked by `mobile_ensure_avd_running` | Maestro driver gRPC health check + auto-heal. Catches the "AVD up, Maestro driver wedged" case where every recipe times out without a recipe-side error. | PR #233 — commit `8b6e4f0` ("auto-heal Maestro driver in mobile_ensure_avd_running"). |
| `connect_preflight_learn_app_user` | `mcp/connect/backends/commcare-preflight.ts` (atom impl) + `mcp/connect-server.ts` (wiring); recommended caller `skills/connect-opp-setup/SKILL.md` Step 7.5 | Auth / domain / user-conflict failures on the CCHQ side of `POST /users/start_learn_app/` — rotated API key, archived domain, CCHQ outage, already-linked-to-different-ConnectID user. Surfaces as structured `{ok, action, reason}` outcome before Phase 6 boots the AVD instead of as a runtime client-side noop. | PR #249 (commit `8677225`). |
| `app-release` CCZ slug-length projection | `mcp/connect/backends/commcare.ts` (`SLUG_LENGTH_LIMIT`, `simulateConnectSync.oversized_slugs`, `max_slug_length`) + `skills/app-release/SKILL.md` § Step 6 (BLOCKER gate) + `skills/pdd-to-{learn,deliver}-app/SKILL.md` (architect-brief REQUIRED clause) | Connect's `LearnModule.slug` / `DeliverUnit.slug` are `SlugField()` with the default `max_length=50`. Nova's `compile_app` derives slugs as `module_<index>_<slugified-name>`; module names ≥ ~40 chars overflow. The DB INSERT raises Postgres `DataError: value too long for type character varying(50)`, which falls through `program/api/views.py:102`'s narrow except and surfaces as HTTP 500 with empty body from `connect_create_opportunity`. Same shape as the 2026-05-12 `short_description` 50-char trap but at the CCZ extract path rather than the serializer — so the *generalized serializer-vs-model length probe* (still pending below) would NOT have caught it. This is a sibling probe at a different boundary. | `docs/learnings/2026-05-17-connect-slug-length-50-char-trap.md` + reproducer in `leep-paint-collection/20260517-1515` Phase 4 (module name "Stage 2: Sample Preparation, Drying, Bagging, Shipment" → slug `module_6_stage_2_sample_prep_drying_bagging_shipment`, 52 chars). |
| `ocs_upload_collection_files` `file_path` mode | `mcp/ocs-server.ts` (`decodeUploadCollectionFileSource`) + tests at `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` | Output-token budget exhaustion: caller-supplied `content` (base64) inputs > ~10KB stall model generation mid-tool_use-emit, surfacing as `API Error: Stream idle timeout - partial response received` with no actionable diagnostic. `file_path` mode lets the MCP read + b64-encode server-side so the agent never holds the payload as output tokens. Exactly-one-source-per-file invariant enforced (rejects both / neither). Companion examples already shipped at this same class boundary: `commcare_upload_multimedia.file_bytes_path` and `commcare_patch_xform.new_xform_xml_path`. | `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md` + reproducer in `leep-paint-collection/20260517-1515` Phase 5 (two consecutive `ace:ocs-setup` dispatches at ~30min/49 calls + ~114min/44 calls both stalled mid-b64-emit on a ~67KB PDD payload; no Drive artifacts written). |

## Pending probes

Expand All @@ -31,6 +32,7 @@ Class-level preventers we know are needed but haven't shipped:
| **Selector-map currency probe** — `bin/ace-doctor --preflight` cross-checks recipes in `mcp/mobile/recipes/static/` vs `mcp/mobile/selectors/<APK>.yaml` for the deployed APK version | Recipes go stale when Connect APK ships UI changes; current symptom is silent `btn_start` no-op at recipe runtime. Same class as `cloud_emu` but for selector-map vs deployed-APK skew. | Implied by `CLAUDE.md` ("`REPLACE_*` selectors that must be filled via `maestro studio` against the Connect APK before live runs") + the 2026-04-30 `btn_start` noop refuted in commit `caba0b8`. |
| **`mobile_resolve_selectors` at Phase 2 authoring gate** — shift-left of the Phase 5 selector-resolution gate into `app-test-cases` | Same selector-currency class as above but a *producer-side* preventer (catch at authoring time, not at runtime). Currently the only check is the Phase 5 recipe-execution gate; an authoring-gate probe would fail closed before any mobile run. | Sibling of selector-map currency; surfaces when authoring touches a recipe whose selector map hasn't been re-resolved for the current APK. |
| **Generalized serializer-vs-model length probe** — pattern-match across commcare-connect's `CharField` definitions, surface mismatches at MCP startup | The `short_description` 50-char trap is one instance; any other field where DRF serializer `max_length` exceeds the model `max_length` is the same bug class. A static scan over commcare-connect's `models.py` + `serializers.py` would surface all candidates as Zod caps. **Note 2026-05-17:** the slug-length trap (now Shipped above) was the SAME class but at a *different* boundary — the slug isn't sent through any serializer, it's derived server-side from CCZ XML. A truly generalized probe should walk every `Char`/`SlugField` in commcare-connect's models AND check both serializer-fed paths and CCZ-extracted paths (the latter is what `app-release` Step 6 now does for the slug case specifically). | Generalization of `docs/learnings/2026-05-12-connect-opp-short-description-50-char-trap.md § Generalization` + `docs/learnings/2026-05-17-connect-slug-length-50-char-trap.md`. |
| **Generalized "MCP atom accepts large payload as string param" audit** — pattern-match every MCP atom whose input schema takes a `string` field that may carry > ~10KB of base64 / XML / JSON, and add a `_path` companion field where missing | The 2026-05-19 b64-context wedge (now Shipped above) is one instance. Companion atoms that already follow the pattern correctly: `commcare_upload_multimedia.file_bytes_path`, `commcare_patch_xform.new_xform_xml_path`. Still missing: `drive_upload_binary.content` (inline b64 only); `drive_create_file.content` and `drive_update_file.content` (text, but unbounded — > 100KB Drive docs could stall). A static scan of MCP server tool schemas would enumerate every remaining wedge candidate. | Generalization of `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md § Generalization`. |

## Pattern characteristics

Expand Down
100 changes: 100 additions & 0 deletions docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# `ocs_upload_collection_files` Inline-Base64 Context Wedge (Phase 5 Stream-Idle Timeout)

**Status:** Mitigated in ACE v0.13.279 via `file_path` mode on `ocs_upload_collection_files`. Old inline `content` mode preserved for tiny strings (back-compat) but discouraged via tool description.

**Origin:** `leep-paint-collection` run `20260517-1515` Phase 5 — two consecutive `ace:ocs-setup` subagent dispatches hit stream-idle timeouts (one at ~30 min / 49 tool calls, second at ~114 min / 44 tool calls) without writing any Drive artifacts. Both stalled at the same point.

## What was framed as the bug (early hypotheses, in order of refutation)

1. **RAG indexing wedge** — `ocs_wait_for_collection_indexing` polling forever. Refuted: the wedge happened before any indexing call.
2. **Per-prompt QA loop** — `ocs_send_test_message` taking minutes each. Refuted: Phase 5 never reached QA in either dispatch.
3. **Auth / re-login churn** — Playwright session expired, atoms retrying. Refuted: no auth errors in either transcript.
4. **OCS server slowness** — Refuted: every atom that ACTUALLY ran completed in seconds. `ocs_clone_chatbot` ~13s, `ocs_create_collection` ~2s, `ocs_upload_collection_files` ~4s.

## What the bug actually is

**Model-generation stall caused by inflated agent context.** Both dispatches followed the same pattern:

1. Agent built the RAG content pack on disk (PDD + summaries + test prompts ≈ 50-67 KB combined).
2. Agent ran `base64 <file >file.b64` via Bash to encode it (correct).
3. **Agent then `Read` the resulting `.b64` files in quarters back into its own context** so it could emit the b64 string as part of the `ocs_upload_collection_files` tool_use `input.files[].content` field.
4. Next assistant turn stalled mid-emission with `API Error: Stream idle timeout - partial response received`. The stall happened either WHILE emitting the b64 (dispatch 2 had a 15-minute mid-stall before one upload landed) OR AFTER the upload returned cleanly but with the b64 still in context (dispatch 2 ran for another 90 minutes after a successful upload before terminating).

The root cause: the `ocs_upload_collection_files` MCP atom's `content` field required a base64 string the agent had to generate as output tokens. For non-trivial RAG payloads (10s of KB → 100s of KB of b64 ASCII), generating that many output tokens in a single tool_use input either stalls outright or accumulates enough context to stall the next turn.

## Proof (session-log evidence)

Source: `~/.claude/projects/-Users-jjackson-emdash-worktrees-ace-emdash-e2e-leep-paint-vsvc9/10b0a209-02b1-48ac-9c49-1a4a0309db96/subagents/agent-{a9539456f8738dccc,aabc8f9d0efda8e30}.jsonl`.

Dispatch 1 (`a9539456f8738dccc`): last 10 tool calls all `Read` calls on b64-chunk tmp files (`/tmp/b64_0_q1..q4.txt`). Final assistant text reads "Excellent. Now I have all 4 b64 chunks... The cleanest path: build the JSON in...". Next token never arrives. Stream-idle terminator. **22m 42s silent gap** between the last tool call and the timeout.

Dispatch 2 (`aabc8f9d0efda8e30`): same prefix shape (b64 chunks, Read calls), then one `ocs_upload_collection_files` succeeded with 3 files in 4s. Last assistant text reads "3 of 4 uploaded. Now upload the 4th." Next token never arrives. **1h 30m silent gap** before stream-idle. The successful upload's b64 was still in context, sufficient to stall the next turn.

OCS atoms never reached `ocs_wait_for_collection_indexing`, `ocs_set_chatbot_system_prompt`, `ocs_set_chatbot_pipeline`, or any QA step.

## Fix shipped (ACE v0.13.279)

`mcp/ocs-server.ts` — `ocs_upload_collection_files` extended to accept `file_path` as an alternative source per file. The MCP reads the file server-side, no b64 ever crosses the agent's context. New exclusivity rule enforced server-side: each file MUST supply EXACTLY ONE of `content` (legacy inline b64) or `file_path` (absolute filesystem path). Mixed or missing sources fail fast with a named error citing the offending file.

Refactor: the file-decoding logic moved into a standalone exported helper `decodeUploadCollectionFileSource` so it's unit-testable in isolation.

7 new vitest cases in `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` covering:

- `file_path` reads UTF-8 text bytes verbatim
- `file_path` reads arbitrary binary bytes verbatim
- `content` (legacy) decodes inline b64
- Missing source → typed error naming the file
- Both sources → typed error naming the file
- ENOENT on missing file_path propagates cleanly

## Skill / agent-side guidance

Phase 5 `ocs-content-pack` + any future skill that calls `ocs_upload_collection_files` SHOULD use `file_path` for any payload > ~1KB. The pattern:

```ts
// Write the content to a tmp file via Bash. Never Read it back.
await Bash(`echo "$content" > /tmp/leep-rag/pdd-summary.md`);
// Or: drive_download_binary into a tmp path for files already on Drive.
await Bash(`drive_download_binary ... | base64 -d > /tmp/leep-rag/pdd.md`);

// Then upload by reference:
await ocs_upload_collection_files({
collection_id: 123,
files: [{
name: 'pdd.md',
file_path: '/tmp/leep-rag/pdd.md', // absolute path; MCP reads + b64s server-side
mime_type: 'text/markdown',
}],
});
```

DO NOT `Read` the `.b64` files. DO NOT `Read` the original markdown files into context if all you're going to do is re-emit them through the upload tool — that's the wedge.

## Generalization

This is a different shape from the 50-char slug trap (#347/#1195) and the `short_description` 50-char trap (`docs/learnings/2026-05-12-connect-opp-short-description-50-char-trap.md`):

| | short_description / slug trap | b64-context wedge |
|---|---|---|
| Layer | Connect DB column / serializer | Agent context / model generation |
| Failure shape | Opaque HTTP 500 with empty body | Stream-idle timeout (no error response, just stall) |
| Pre-fix preventer | Column width / serializer validation | None |
| Post-fix preventer | Zod cap / CCZ projection gate | MCP atom accepts file_path (caller never holds payload) |
| Class | Postgres column overflow | Output-token budget exhaustion |

**Generalized boundary-probe candidate (for the registry):** any MCP atom whose input schema accepts large binary-or-encoded content as a string parameter is a wedge candidate. The systemic fix is to give every such atom a `file_path` (or `drive_file_id`) alternate source so the agent never holds the payload as output tokens. Audit candidates today:

- `commcare_upload_multimedia` — already has `file_bytes_path` (correct pattern; this PR's `file_path` adoption matches it)
- `drive_upload_binary` — currently inline `content` only; same wedge class
- `ocs_upload_collection_files` — fixed by this PR
- `commcare_patch_xform` — has both `new_xform_xml` and `new_xform_xml_path`; correct
- `drive_create_file` / `drive_update_file` — content is typically text (markdown, YAML), so the wedge bound is higher but still real for >100KB docs

The registry entry under § Shipped probes would name "input-payload size at MCP atom boundary" as the class, with this PR + `commcare_upload_multimedia`'s `file_bytes_path` + `commcare_patch_xform`'s `new_xform_xml_path` as the existing instances.

## See also

- `docs/learnings/2026-05-12-boundary-probe-registry.md` — registry update will add this as Shipped probe.
- `mcp/ocs-server.ts` § `ocs_upload_collection_files` — the fix.
- `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` — the tests.
- Session log subagent transcripts (above) — the bisect evidence.
Loading