jjackson · jjackson · May 19, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -6,13 +6,13 @@
     "url": "https://github.com/jjackson"
   },
   "metadata": {
-    "version": "0.13.277"
+    "version": "0.13.285"
   },
   "plugins": [
     {
       "name": "ace",
       "source": "./",
-      "version": "0.13.277",
+      "version": "0.13.285",
       "description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout"
     }
   ]

diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "ace",
-  "version": "0.13.277",
+  "version": "0.13.285",
   "description": "AI Connect Engine — orchestrates the CRISPR-Connect lifecycle from idea through app building, Connect setup, LLO management, and closeout",
   "author": {
     "name": "Jonathan Jackson",

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,20 @@ All notable changes to the ACE plugin will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and the plugin follows [semantic versioning](https://semver.org/spec/v2.0.0.html).
 
+## 0.13.285 — 2026-05-19
+
+**Add `file_path` mode to `ocs_upload_collection_files` — close the b64 context wedge that stalled Phase 5 twice.**
+
+Two consecutive `ace:ocs-setup` dispatches on `leep-paint-collection/20260517-1515` hit stream-idle timeouts (one at ~30 min / 49 tool calls, second at ~114 min / 44 tool calls) without writing any Drive artifacts. Session-log bisect (see `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md`) pinned the cause: the agent built its RAG content pack on disk (~67 KB), `base64`-encoded it via Bash, then **`Read` the resulting `.b64` chunks back into its own context** so it could emit them as the `ocs_upload_collection_files` tool_use `input.files[].content` field. Generating 100s of KB of b64 as output tokens stalls model generation either mid-emit or on the next turn. No OCS slowness, no auth churn, no QA loop — pure output-token budget exhaustion.
+
+Fix: `ocs_upload_collection_files` extended to accept `file_path` as an alternative source per file. The MCP reads the file server-side, no b64 ever crosses the agent's context. Each file MUST supply EXACTLY ONE of `content` (legacy inline b64) or `file_path` (absolute filesystem path); mixed or missing sources fail fast with a named error citing the offending file.
+
+Refactor: the file-decoding logic moved into `decodeUploadCollectionFileSource`, exported for unit-testability. 7 new vitest cases (UTF-8 text via file_path, arbitrary binary via file_path, inline content legacy mode, missing source, both sources, ENOENT propagation, error names the offending file). All pass.
+
+Skill-side guidance: Phase 5 `ocs-content-pack` and any future skill calling this atom SHOULD use `file_path` for any payload > ~1KB. For files on Drive, `drive_download_binary` into a tmp path first, then pass that as `file_path` — keeps the b64 entirely out of agent context.
+
+`docs/learnings/2026-05-12-boundary-probe-registry.md` updated with the new Shipped probe + a new pending row generalizing the audit ("every MCP atom whose input schema takes a `string` that may carry > ~10KB of payload should have a `_path` companion"; existing examples: `commcare_upload_multimedia.file_bytes_path`, `commcare_patch_xform.new_xform_xml_path`).
+
 ## 0.13.277 — 2026-05-18
 
 **Mirror Vellum's slug/name separation in the Nova architect brief (follow-up to 0.13.274).**

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.13.277
+0.13.285
diff --git a/docs/learnings/2026-05-12-boundary-probe-registry.md b/docs/learnings/2026-05-12-boundary-probe-registry.md
@@ -21,6 +21,7 @@ ACE has a pattern called **boundary probes**: load-bearing client-side pre-fligh
 | `mobile_probe_maestro_driver` | `mcp/mobile-server.ts:143` (atom) → `mcp/mobile/client.ts:169` (impl); auto-invoked by `mobile_ensure_avd_running` | Maestro driver gRPC health check + auto-heal. Catches the "AVD up, Maestro driver wedged" case where every recipe times out without a recipe-side error. | PR #233 — commit `8b6e4f0` ("auto-heal Maestro driver in mobile_ensure_avd_running"). |
 | `connect_preflight_learn_app_user` | `mcp/connect/backends/commcare-preflight.ts` (atom impl) + `mcp/connect-server.ts` (wiring); recommended caller `skills/connect-opp-setup/SKILL.md` Step 7.5 | Auth / domain / user-conflict failures on the CCHQ side of `POST /users/start_learn_app/` — rotated API key, archived domain, CCHQ outage, already-linked-to-different-ConnectID user. Surfaces as structured `{ok, action, reason}` outcome before Phase 6 boots the AVD instead of as a runtime client-side noop. | PR #249 (commit `8677225`). |
 | `app-release` CCZ slug-length projection | `mcp/connect/backends/commcare.ts` (`SLUG_LENGTH_LIMIT`, `simulateConnectSync.oversized_slugs`, `max_slug_length`) + `skills/app-release/SKILL.md` § Step 6 (BLOCKER gate) + `skills/pdd-to-{learn,deliver}-app/SKILL.md` (architect-brief REQUIRED clause) | Connect's `LearnModule.slug` / `DeliverUnit.slug` are `SlugField()` with the default `max_length=50`. Nova's `compile_app` derives slugs as `module_<index>_<slugified-name>`; module names ≥ ~40 chars overflow. The DB INSERT raises Postgres `DataError: value too long for type character varying(50)`, which falls through `program/api/views.py:102`'s narrow except and surfaces as HTTP 500 with empty body from `connect_create_opportunity`. Same shape as the 2026-05-12 `short_description` 50-char trap but at the CCZ extract path rather than the serializer — so the *generalized serializer-vs-model length probe* (still pending below) would NOT have caught it. This is a sibling probe at a different boundary. | `docs/learnings/2026-05-17-connect-slug-length-50-char-trap.md` + reproducer in `leep-paint-collection/20260517-1515` Phase 4 (module name "Stage 2: Sample Preparation, Drying, Bagging, Shipment" → slug `module_6_stage_2_sample_prep_drying_bagging_shipment`, 52 chars). |
+| `ocs_upload_collection_files` `file_path` mode | `mcp/ocs-server.ts` (`decodeUploadCollectionFileSource`) + tests at `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` | Output-token budget exhaustion: caller-supplied `content` (base64) inputs > ~10KB stall model generation mid-tool_use-emit, surfacing as `API Error: Stream idle timeout - partial response received` with no actionable diagnostic. `file_path` mode lets the MCP read + b64-encode server-side so the agent never holds the payload as output tokens. Exactly-one-source-per-file invariant enforced (rejects both / neither). Companion examples already shipped at this same class boundary: `commcare_upload_multimedia.file_bytes_path` and `commcare_patch_xform.new_xform_xml_path`. | `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md` + reproducer in `leep-paint-collection/20260517-1515` Phase 5 (two consecutive `ace:ocs-setup` dispatches at ~30min/49 calls + ~114min/44 calls both stalled mid-b64-emit on a ~67KB PDD payload; no Drive artifacts written). |
 
 ## Pending probes
 
@@ -31,6 +32,7 @@ Class-level preventers we know are needed but haven't shipped:
 | **Selector-map currency probe** — `bin/ace-doctor --preflight` cross-checks recipes in `mcp/mobile/recipes/static/` vs `mcp/mobile/selectors/<APK>.yaml` for the deployed APK version | Recipes go stale when Connect APK ships UI changes; current symptom is silent `btn_start` no-op at recipe runtime. Same class as `cloud_emu` but for selector-map vs deployed-APK skew. | Implied by `CLAUDE.md` ("`REPLACE_*` selectors that must be filled via `maestro studio` against the Connect APK before live runs") + the 2026-04-30 `btn_start` noop refuted in commit `caba0b8`. |
 | **`mobile_resolve_selectors` at Phase 2 authoring gate** — shift-left of the Phase 5 selector-resolution gate into `app-test-cases` | Same selector-currency class as above but a *producer-side* preventer (catch at authoring time, not at runtime). Currently the only check is the Phase 5 recipe-execution gate; an authoring-gate probe would fail closed before any mobile run. | Sibling of selector-map currency; surfaces when authoring touches a recipe whose selector map hasn't been re-resolved for the current APK. |
 | **Generalized serializer-vs-model length probe** — pattern-match across commcare-connect's `CharField` definitions, surface mismatches at MCP startup | The `short_description` 50-char trap is one instance; any other field where DRF serializer `max_length` exceeds the model `max_length` is the same bug class. A static scan over commcare-connect's `models.py` + `serializers.py` would surface all candidates as Zod caps. **Note 2026-05-17:** the slug-length trap (now Shipped above) was the SAME class but at a *different* boundary — the slug isn't sent through any serializer, it's derived server-side from CCZ XML. A truly generalized probe should walk every `Char`/`SlugField` in commcare-connect's models AND check both serializer-fed paths and CCZ-extracted paths (the latter is what `app-release` Step 6 now does for the slug case specifically). | Generalization of `docs/learnings/2026-05-12-connect-opp-short-description-50-char-trap.md § Generalization` + `docs/learnings/2026-05-17-connect-slug-length-50-char-trap.md`. |
+| **Generalized "MCP atom accepts large payload as string param" audit** — pattern-match every MCP atom whose input schema takes a `string` field that may carry > ~10KB of base64 / XML / JSON, and add a `_path` companion field where missing | The 2026-05-19 b64-context wedge (now Shipped above) is one instance. Companion atoms that already follow the pattern correctly: `commcare_upload_multimedia.file_bytes_path`, `commcare_patch_xform.new_xform_xml_path`. Still missing: `drive_upload_binary.content` (inline b64 only); `drive_create_file.content` and `drive_update_file.content` (text, but unbounded — > 100KB Drive docs could stall). A static scan of MCP server tool schemas would enumerate every remaining wedge candidate. | Generalization of `docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md § Generalization`. |
 
 ## Pattern characteristics
 

diff --git a/docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md b/docs/learnings/2026-05-19-ocs-upload-b64-context-wedge.md
@@ -0,0 +1,100 @@
+# `ocs_upload_collection_files` Inline-Base64 Context Wedge (Phase 5 Stream-Idle Timeout)
+
+**Status:** Mitigated in ACE v0.13.279 via `file_path` mode on `ocs_upload_collection_files`. Old inline `content` mode preserved for tiny strings (back-compat) but discouraged via tool description.
+
+**Origin:** `leep-paint-collection` run `20260517-1515` Phase 5 — two consecutive `ace:ocs-setup` subagent dispatches hit stream-idle timeouts (one at ~30 min / 49 tool calls, second at ~114 min / 44 tool calls) without writing any Drive artifacts. Both stalled at the same point.
+
+## What was framed as the bug (early hypotheses, in order of refutation)
+
+1. **RAG indexing wedge** — `ocs_wait_for_collection_indexing` polling forever. Refuted: the wedge happened before any indexing call.
+2. **Per-prompt QA loop** — `ocs_send_test_message` taking minutes each. Refuted: Phase 5 never reached QA in either dispatch.
+3. **Auth / re-login churn** — Playwright session expired, atoms retrying. Refuted: no auth errors in either transcript.
+4. **OCS server slowness** — Refuted: every atom that ACTUALLY ran completed in seconds. `ocs_clone_chatbot` ~13s, `ocs_create_collection` ~2s, `ocs_upload_collection_files` ~4s.
+
+## What the bug actually is
+
+**Model-generation stall caused by inflated agent context.** Both dispatches followed the same pattern:
+
+1. Agent built the RAG content pack on disk (PDD + summaries + test prompts ≈ 50-67 KB combined).
+2. Agent ran `base64 <file >file.b64` via Bash to encode it (correct).
+3. **Agent then `Read` the resulting `.b64` files in quarters back into its own context** so it could emit the b64 string as part of the `ocs_upload_collection_files` tool_use `input.files[].content` field.
+4. Next assistant turn stalled mid-emission with `API Error: Stream idle timeout - partial response received`. The stall happened either WHILE emitting the b64 (dispatch 2 had a 15-minute mid-stall before one upload landed) OR AFTER the upload returned cleanly but with the b64 still in context (dispatch 2 ran for another 90 minutes after a successful upload before terminating).
+
+The root cause: the `ocs_upload_collection_files` MCP atom's `content` field required a base64 string the agent had to generate as output tokens. For non-trivial RAG payloads (10s of KB → 100s of KB of b64 ASCII), generating that many output tokens in a single tool_use input either stalls outright or accumulates enough context to stall the next turn.
+
+## Proof (session-log evidence)
+
+Source: `~/.claude/projects/-Users-jjackson-emdash-worktrees-ace-emdash-e2e-leep-paint-vsvc9/10b0a209-02b1-48ac-9c49-1a4a0309db96/subagents/agent-{a9539456f8738dccc,aabc8f9d0efda8e30}.jsonl`.
+
+Dispatch 1 (`a9539456f8738dccc`): last 10 tool calls all `Read` calls on b64-chunk tmp files (`/tmp/b64_0_q1..q4.txt`). Final assistant text reads "Excellent. Now I have all 4 b64 chunks... The cleanest path: build the JSON in...". Next token never arrives. Stream-idle terminator. **22m 42s silent gap** between the last tool call and the timeout.
+
+Dispatch 2 (`aabc8f9d0efda8e30`): same prefix shape (b64 chunks, Read calls), then one `ocs_upload_collection_files` succeeded with 3 files in 4s. Last assistant text reads "3 of 4 uploaded. Now upload the 4th." Next token never arrives. **1h 30m silent gap** before stream-idle. The successful upload's b64 was still in context, sufficient to stall the next turn.
+
+OCS atoms never reached `ocs_wait_for_collection_indexing`, `ocs_set_chatbot_system_prompt`, `ocs_set_chatbot_pipeline`, or any QA step.
+
+## Fix shipped (ACE v0.13.279)
+
+`mcp/ocs-server.ts` — `ocs_upload_collection_files` extended to accept `file_path` as an alternative source per file. The MCP reads the file server-side, no b64 ever crosses the agent's context. New exclusivity rule enforced server-side: each file MUST supply EXACTLY ONE of `content` (legacy inline b64) or `file_path` (absolute filesystem path). Mixed or missing sources fail fast with a named error citing the offending file.
+
+Refactor: the file-decoding logic moved into a standalone exported helper `decodeUploadCollectionFileSource` so it's unit-testable in isolation.
+
+7 new vitest cases in `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` covering:
+
+- `file_path` reads UTF-8 text bytes verbatim
+- `file_path` reads arbitrary binary bytes verbatim
+- `content` (legacy) decodes inline b64
+- Missing source → typed error naming the file
+- Both sources → typed error naming the file
+- ENOENT on missing file_path propagates cleanly
+
+## Skill / agent-side guidance
+
+Phase 5 `ocs-content-pack` + any future skill that calls `ocs_upload_collection_files` SHOULD use `file_path` for any payload > ~1KB. The pattern:
+
+```ts
+// Write the content to a tmp file via Bash. Never Read it back.
+await Bash(`echo "$content" > /tmp/leep-rag/pdd-summary.md`);
+// Or: drive_download_binary into a tmp path for files already on Drive.
+await Bash(`drive_download_binary ... | base64 -d > /tmp/leep-rag/pdd.md`);
+
+// Then upload by reference:
+await ocs_upload_collection_files({
+  collection_id: 123,
+  files: [{
+    name: 'pdd.md',
+    file_path: '/tmp/leep-rag/pdd.md',  // absolute path; MCP reads + b64s server-side
+    mime_type: 'text/markdown',
+  }],
+});
+```
+
+DO NOT `Read` the `.b64` files. DO NOT `Read` the original markdown files into context if all you're going to do is re-emit them through the upload tool — that's the wedge.
+
+## Generalization
+
+This is a different shape from the 50-char slug trap (#347/#1195) and the `short_description` 50-char trap (`docs/learnings/2026-05-12-connect-opp-short-description-50-char-trap.md`):
+
+| | short_description / slug trap | b64-context wedge |
+|---|---|---|
+| Layer | Connect DB column / serializer | Agent context / model generation |
+| Failure shape | Opaque HTTP 500 with empty body | Stream-idle timeout (no error response, just stall) |
+| Pre-fix preventer | Column width / serializer validation | None |
+| Post-fix preventer | Zod cap / CCZ projection gate | MCP atom accepts file_path (caller never holds payload) |
+| Class | Postgres column overflow | Output-token budget exhaustion |
+
+**Generalized boundary-probe candidate (for the registry):** any MCP atom whose input schema accepts large binary-or-encoded content as a string parameter is a wedge candidate. The systemic fix is to give every such atom a `file_path` (or `drive_file_id`) alternate source so the agent never holds the payload as output tokens. Audit candidates today:
+
+- `commcare_upload_multimedia` — already has `file_bytes_path` (correct pattern; this PR's `file_path` adoption matches it)
+- `drive_upload_binary` — currently inline `content` only; same wedge class
+- `ocs_upload_collection_files` — fixed by this PR
+- `commcare_patch_xform` — has both `new_xform_xml` and `new_xform_xml_path`; correct
+- `drive_create_file` / `drive_update_file` — content is typically text (markdown, YAML), so the wedge bound is higher but still real for >100KB docs
+
+The registry entry under § Shipped probes would name "input-payload size at MCP atom boundary" as the class, with this PR + `commcare_upload_multimedia`'s `file_bytes_path` + `commcare_patch_xform`'s `new_xform_xml_path` as the existing instances.
+
+## See also
+
+- `docs/learnings/2026-05-12-boundary-probe-registry.md` — registry update will add this as Shipped probe.
+- `mcp/ocs-server.ts` § `ocs_upload_collection_files` — the fix.
+- `test/mcp/ocs/unit/upload-collection-files-decoder.test.ts` — the tests.
+- Session log subagent transcripts (above) — the bisect evidence.