Merge upstream GBrain v0.36.5 while preserving Eva OpenClaw defaults#104
Open
100yenadmin wants to merge 19 commits into
Open
Merge upstream GBrain v0.36.5 while preserving Eva OpenClaw defaults#104100yenadmin wants to merge 19 commits into
100yenadmin wants to merge 19 commits into
Conversation
… placement (garrytan#1053) * refactor(mcp): centralize ParamDef→JSON Schema via shared paramDefToSchema Three duplicate inline mappers existed across the MCP surface: - src/mcp/tool-defs.ts (stdio MCP buildToolDefs) - src/commands/serve-http.ts:837 (live HTTP MCP tools/list) - src/core/minions/tools/brain-allowlist.ts:84 (subagent tool registry) Each had subtly different items propagation. The HTTP MCP variant dropped items entirely, leaving extract_facts.entity_hints broken for OAuth- authenticated remote agents even after a buildToolDefs-only patch. The subagent variant propagated one level of items but used the same shallow shape so nested arrays would silently drop. Extract a single recursive paramDefToSchema helper exported from src/mcp/tool-defs.ts and have all three mappers consume it. Closes the bug class at the architecture level instead of patching one site at a time. The helper copies type, description, enum, default, and recursively rebuilds items so array-of-arrays preserves inner shape. Key ordering (type, description, enum, default, items) matches the pre-v0.34 inline mappers so JSON.stringify output stays byte-stable for every existing operation that does not use nested arrays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(schema): add items to extract_facts.entity_hints and handle-to-tweet candidates Two array fields shipped without the items property required by JSON Schema. Strict-mode validators (Gemini Pro structured outputs, OpenAI strict tool definitions) reject the entire schema when any type:'array' lacks items. Downstream agents on those providers couldn't use extract_facts or the x_handle_to_tweet resolver. extract_facts.entity_hints — declared items: { type: 'string' } matching the handler at src/core/operations.ts:2733 which already coerces the runtime value to string[]. handle_to_tweet outputSchema.candidates — full XTweetCandidate spec including required + additionalProperties: false. The XTweetCandidate TypeScript interface declares all five fields as required; without required in the JSON Schema, a validator would accept {} as a valid candidate. additionalProperties: false closes the OpenAI strict-mode contract. 19 community PRs (garrytan#1028 garrytan#999 garrytan#980 garrytan#979 garrytan#910 garrytan#904 garrytan#847 garrytan#832 garrytan#863 garrytan#862 garrytan#812 for entity_hints; garrytan#910 caught candidates) converged on these locations. This wave cherry-picks the deepest variant (garrytan#910 surfaced both bugs) and centralizes via the paramDefToSchema helper from the preceding commit so the live HTTP MCP tools/list path is also fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: DmitryBMsk (PR garrytan#910) * fix(git-remote): move --no-recurse-submodules after the subcommand verb Git CLI accepts two flag positions: git [global -c flags] <subcommand> [subcommand flags] [args] Global -c config flags belong before the verb. Subcommand-specific flags (like --no-recurse-submodules) belong after. Pre-v0.34 GIT_SSRF_FLAGS spliced both kinds before the verb, so cloneRepo invoked: git -c http.followRedirects=false ... --no-recurse-submodules clone URL DIR Real git rejects this with exit 129 ("unknown option: --no-recurse-submodules") because --no-recurse-submodules is a clone subcommand flag, not a global config flag. Every remote-source clone broke in production from v0.28 onward. The fake-git harness in test/git-remote.test.ts exits 0 regardless of argv shape, which is why CI never caught it. Split GIT_SSRF_FLAGS (3 -c config flags, spread BEFORE the verb) from GIT_SSRF_SUBCOMMAND_FLAGS (--no-recurse-submodules, spread AFTER the verb). cloneRepo and pullRepo both spread the new constant after their respective verbs. The constant names signal the position rule so future additions land in the right place. 7 community PRs converged on this location (garrytan#1023 garrytan#1020 garrytan#985 garrytan#963 garrytan#846 garrytan#842 — garrytan#800 doesn't exist). This wave cherry-picks the semantic- constant approach from garrytan#846's GIT_SSRF_SUBCOMMAND_FLAGS name (the clearest signal of the position rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(mcp+git+resolvers): structural array-items + subcommand-position guards Three new tests / test groups close the bug classes the wave fixes: test/mcp-tool-defs.test.ts — recursive structural guard walks every operation's inputSchema and fails with a property path if any type:'array' lacks items.type. Explicit fixture assertions for extract_facts.entity_hints.items.type and a synthetic nested-array ParamDef pinning items.items.type recursion. Without the explicit fixtures the legacyInlineMap byte-equality test is mirror-theater — mirroring both sides of the equality preserves the blind spot. test/git-remote.test.ts — split snapshot test into GIT_SSRF_FLAGS (3 global -c entries) and GIT_SSRF_SUBCOMMAND_FLAGS (--no-recurse-submodules). cloneRepo + pullRepo argv tests now assert the subcommand flag appears AFTER the verb index. Pre-v0.34 the pinned argv slice prefix included --no-recurse-submodules, which baked the bug into the test suite (codex catch). test/resolvers.test.ts — recursive walk over both inputSchema AND outputSchema for builtin resolvers (xHandleToTweetResolver, urlReachableResolver). Explicit imports rather than getDefaultRegistry(), which starts empty until commands/resolvers.ts runs — codex catch on a hollow-walk failure mode. Dedicated case pins candidates items shape including required + additionalProperties. Reference legacyInlineMap in mcp-tool-defs.test.ts mirrors the new recursive paramDefToSchema helper. No current op uses nested arrays so the byte-equality test stays green for every existing operation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): raise rerank timeouts for ZE live cold-start The first rerank call of a CI run hits ZeroEntropy's cold-start latency (observed ~5-6s on Tier 2 LLM Skills runners; subsequent calls < 500ms). Two timeouts fired simultaneously at ~5s: 1. bun:test's default 5000ms per-test timeout caused (fail). 2. gateway.rerank's DEFAULT_RERANK_TIMEOUT_MS = 5000 fired right after, reported as "Unhandled error between tests". The next rerank test (top_n=2) ran in 409ms because the API was already warm. Cold-start is the only issue. Pass explicit timeoutMs to each rerank() call and a longer per-test timeout (30s) on both ZE rerank tests. Production DEFAULT_RERANK_TIMEOUT_MS stays at 5s for the search hot path — these E2E tests bypass it locally without changing the default that protects user latency. Unrelated to the fix-wave in this PR (mcp-tool-defs + git-remote + resolver guards). Lands here to keep Tier 2 LLM Skills green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.35.2.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync for v0.35.2.0 Update CLAUDE.md Key files annotations for the v0.35.2.0 fix wave: - src/mcp/tool-defs.ts: document new exported recursive paramDefToSchema helper and the three-consumer centralization (stdio MCP, HTTP MCP tools/list, subagent registry). - src/core/minions/tools/brain-allowlist.ts: paramsToInputSchema now consumes the shared helper. - src/commands/serve-http.ts: tools/list handler now consumes the shared helper (closes the HTTP MCP items-dropped bug class). - src/core/git-remote.ts: new entry. Documents the GIT_SSRF_FLAGS (global config, pre-verb) vs GIT_SSRF_SUBCOMMAND_FLAGS (subcommand-scoped, post-verb) split, the 7-month silent regression, and the position-anchored regression guard in test/git-remote.test.ts. Regenerated llms-full.txt to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: rebump version to v0.35.3.0 Queue moved while this PR was open — v0.35.2.0 was claimed by master's v0.35.1.0 sibling work. Advancing one slot. No code changes; only: - VERSION + package.json: 0.35.2.0 → 0.35.3.0 - CHANGELOG.md: rewritten header + inline references - CLAUDE.md: rewritten 4 key-file annotations - llms-full.txt + llms.txt: regenerated to mirror CLAUDE.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…um (garrytan#1052) * rfc: temporal axis for contradiction probe Field report on residual HIGH findings from gbrain eval suspected-contradictions and proposal for a 4-phase fix (Phase 1 = judge prompt + verdict enum is the recommended starting point). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): pass effective_date to judge prompt; bump PROMPT_VERSION Lane A1 of the temporal-contradiction-probe wave. Threads page-level effective_date through the search projection into the contradiction judge so the LLM can reason about supersession instead of treating every dated pair as a contradiction. Changes: - SearchResult interface adds optional effective_date + effective_date_source fields; rowToSearchResult populates them from the row data with date-only YYYY-MM-DD normalization (handles both postgres.js Date and PGLite string). - 8 SELECT projection sites (3 in postgres-engine, 5 in pglite-engine) now carry p.effective_date + p.effective_date_source through their inner CTEs and outer SELECTs so search results expose the field on both engines. - PairMember (eval-contradictions/types.ts) gets the two fields as required (string | null) so the type forces every constructor to think about temporal anchoring. Runner's searchResultToMember + takeToMember handle the normalization; takes inherit the chunk's page-level date. - buildJudgePrompt emits `Statement A (from: YYYY-MM-DD)` when effective_date is non-null, else `(date unknown)`. Prompt instructions explain the tag so the model knows what to do with it. - PROMPT_VERSION bumps '1' → '2'. Cache-key tuple shape unchanged; old rows miss naturally on first run against the new prompt. Test fixtures in 5 files updated to include the new required fields. All 205 eval-contradictions unit tests + 101 search-related tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): replace contradicts:boolean with verdict:enum (6 members) Lane A2 of the temporal-contradiction-probe wave. Expands the judge's classification vocabulary from a binary contradicts:bool to a six-member verdict enum so the probe can distinguish "this changed" from "this is wrong". Verdict taxonomy: no_contradiction — drop from findings contradiction — genuine conflict at same point in time temporal_supersession — newer claim updates/replaces older; not an error temporal_regression — metric/status went backwards over time (signal) temporal_evolution — legitimate change, neither supersession nor regression negation_artifact — judge misread an explicit negation Changes: - types.ts: Verdict union (6 members); Severity gains 'info'; ResolutionKind extended with temporal_supersede, flag_for_review, log_timeline_change; JudgeVerdict.contradicts → verdict; ContradictionFinding now carries verdict; ProbeReport adds queries_with_any_finding + verdict_breakdown (additive). - judge.ts: parseResolutionKind + parseVerdict guards; normalizeVerdict reads the new field and applies the C1 confidence floor only to verdict='contradiction' (the new verdicts are informational classifications, no floor). Prompt rubric rewritten to ask for verdict + extended severity scale. - severity-classify.ts: 'info' joins the rank with value 0; defaultSeverityForVerdict maps each verdict to its baseline severity (D7 — supersession=info, regression=high, etc.). parseSeverity gains a fallback param so consumers can override 'low' default. - auto-supersession.ts: classifyResolution + renderResolutionCommand handle the three new resolution kinds. Probe still NEVER auto-mutates — the new kinds render paste-ready commands or informational lines. - cache.ts: isJudgeVerdict shape check matches the new verdict field; old v1 rows fail the guard and treat as misses. - runner.ts: emit predicate at cache-hit and judge-success branches changes from `verdict.contradicts` to `verdict.verdict !== 'no_contradiction'`. Without this, the new verdicts vanish from the report. Added per-verdict tally + queriesWithAnyFinding alongside the strict queriesWithContradiction. - trends.ts: latest run verdict breakdown surfaces in the trend chart. Test fixtures updated across 8 test files. All 210 eval-contradictions unit tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): relax date-filter rule 3 when both sides dated Lane B of the temporal-contradiction-probe wave. The v1 date pre-filter skipped pairs whose chunk-text-extracted dates differed by >30 days as a cost-saving heuristic. That heuristic silently killed exactly the cases the new verdict taxonomy exists to surface — role transitions across years (e.g. a 2017 historical record vs. a 2025 current state), MRR claims years apart, status changes recorded over time. Lane A1+A2 made temporal supersession explicit and cheap to classify. The filter no longer needs to skip these pairs; the judge can label them. Changes: - date-filter.ts: shouldSkipForDateMismatch accepts optional effectiveDateA and effectiveDateB. When BOTH are non-null, returns skip=false with the new 'both_have_effective_date' reason — the judge will see the dates via the (from: YYYY-MM-DD) prompt tag from Lane A1. Other rules (same-paragraph dual-date override, missing-date fallback) preserved verbatim and still run first. - runner.ts: threads pair.{a,b}.effective_date into the date-filter call. Pairs that previously vanished into the skip bucket now reach the judge. Tests (R1 IRON RULE regression suite, 6 new cases): - both sides effective_date → not skipped - both sides effective_date overrides >30d chunk-text rule - rule 1 (same-paragraph dual-date) still wins over effective_date relaxation - rule 2 (missing chunk dates) still applies when effective_date partially present - undefined effective_dates fall through to v1 behavior (back-compat) - empty-string effective_date treated as missing (only real dates enable the relaxation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): cost-estimate prompt + --budget-usd + Haiku routing Lane C of the temporal-contradiction-probe wave. Three layers of cost guardrail, all stacked: (a) cost-estimate prompt at probe-run-time. Before the runner spends any tokens after a PROMPT_VERSION change, eval-suspected-contradictions reads the most recent persisted prompt_version from eval_contradictions_runs and compares. When they differ: - TTY: prints an upper-bound estimate + Ctrl-C window (default 10s, override via GBRAIN_PROBE_PROMPT_GRACE_SECONDS). - non-TTY: prints the estimate + auto-proceeds (autopilot path). - --yes override or GBRAIN_NO_PROBE_PROMPT=1: skip entirely. Mirrors the v0.32.7 runPostUpgradeReembedPrompt pattern. (b) --budget-usd N hard cap (pre-existing; PreFlightBudgetError surfaces when the estimate alone exceeds the cap, and CostTracker halts the run mid-flight when cumulative cost exceeds it). Documented in the help text alongside (a). (c) Judge model now routes through resolveModel() with configKey 'models.eval.contradictions_judge', tier 'utility' (Haiku-class default), and env var GBRAIN_CONTRADICTIONS_JUDGE_MODEL. The legacy --judge CLI flag still wins as the highest-precedence override. Doctor's model touchpoint registry (src/commands/models.ts:50) carries the new key so `gbrain models` and `gbrain models doctor` surface it. Also in this lane: - CLI: --severity accepts 'info' (the new Severity member from Lane A2). - CLI: --severity output shows [verdict] tag alongside slug pairs so operators distinguish genuine contradictions from temporal classifications. - Human summary: prints the new queries_with_any_finding metric and the per-verdict breakdown table. - Help text: explains the cost-prompt + budget-cap + model-routing interactions in one paragraph. New tests (9 cases on the cost-prompt helper): - --yes override skips - GBRAIN_NO_PROBE_PROMPT=1 skips - prompt_version unchanged → skips - non-TTY auto-proceeds with stderr note - TTY proceeds after grace - TTY aborts on Ctrl-C - fresh brain (no prior runs) fires the prompt - GBRAIN_PROBE_PROMPT_GRACE_SECONDS override honored - estimate banner contains query count + judge model + dollar amount All 225 eval-contradictions tests + 25 model-config tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): R4/R5/R6 IRON-RULE regressions for the verdict-enum wave Lane D of the temporal-contradiction-probe wave. The Lanes A1/A2/B/C lanes landed the behavior; this lane pins the regressions that protect the wave against future drift. R4 (runner emit predicate): five new tests, one per non-no_contradiction verdict, prove the runner.ts emit rule surfaces each one as a finding with the correct verdict tag, and that: - queries_with_contradiction (Wilson-CI denominator) ONLY counts verdict ='contradiction' — the strict metric is preserved - queries_with_any_finding counts every non-no_contradiction verdict - verdict_breakdown tallies correctly Plus one negative case: verdict='no_contradiction' produces zero findings. Without R4, a future runner refactor could collapse the new verdicts back to /dev/null and the report would silently shrink. R5 (cache key shape): direct shape assertion on buildCacheKey output. The key tuple is exactly 5 fields (chunk_a_hash, chunk_b_hash, model_id, prompt_version, truncation_policy). Adding a 6th field would silently break every operator's brain (no migration path). R6 (contradiction severity unchanged): four tests on normalizeVerdict pin the legacy semantics — judge-supplied severity wins (whether 'high' or 'low'), and on garbage severity input the fallback is 'medium' (per defaultSeverityForVerdict('contradiction')) NOT 'low'. The contradiction verdict's severity must never default to 'low', which would silently mask genuine conflicts as cosmetic naming issues. The temporal_regression case is included for parity (garbage → 'high' since regressions are real investor red flags). 236 eval-contradictions tests pass (211 + 6 R4 + 1 R5 + 4 R6 + 9 cost-prompt from Lane C). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ci): privacy lint for docs/proposals/*.md Captures the residual TODO from the temporal-contradiction-probe wave's plan: prevent the bug class where an RFC lands in docs/proposals/ with PII that should never appear in a public technical artifact. The original RFC had to be scrubbed at force-push time (Step 0); this lint catches the same patterns at CI time so the next one can't slip through. Sibling to scripts/check-privacy.sh: - check-privacy.sh: bans the literal "Wintermute" repo-wide. - check-proposal-pii.sh: focuses on docs/proposals/*.md and the OTHER PII classes — personal-relationship vocabulary, private repo refs. Design contract: the denylist names PATTERNS, not real people. Naming specific real names (deceased relatives, therapist first names, dealflow contacts) inside this script would leak PII into the repo just by appearing here. The structural patterns below catch the SURROUNDING vocabulary that always accompanies such content in personal RFC prose. Trade-off: a future RFC that names a real person without any contextual markers won't be caught — accepted as residual risk handled by human review. Patterns flagged in docs/proposals/*.md: - garrytan/brain (private repo reference) - trial separation, permanent separation - couples session, couples therapist - divorce attorney(s) - grandmother's funeral, aunt's funeral - wintermute (also caught by check-privacy.sh; listed here for proposal-scoped clarity) Bare common words (separation, funeral) are NOT banned — only the combined personal-context phrases. "Separation of concerns" and other software vocabulary survives. Wired into: - `bun run verify` (gates every push) - `bun run check:all` - `bun run check:proposal-pii` (standalone) Tests: 15 cases in test/scripts/check-proposal-pii.test.ts. - Each pattern flagged when present, plus exit-code + stderr signal. - Two negative cases (separation-of-concerns, funeral metaphor) prove the lint doesn't false-positive on legitimate software prose. - No-proposals-dir → exit 0 (not a failure). - Multi-hit case proves all patterns surface together with a summary count. - The two test fixtures that name "Wintermute" / "WINTERMUTE" as sentinel literals are allowlisted in check-test-real-names.sh per the same meta-rule-enforcement exception as check-privacy.sh itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(privacy): allowlist new privacy-guard files in check-privacy.sh check-privacy.sh bans the literal Wintermute repo-wide. The two new files from the v0.34 privacy lint (scripts/check-proposal-pii.sh and its test) necessarily name the token to do their job. Same meta-rule-enforcement exception as scripts/check-privacy.sh itself, scripts/check-test-real-names.sh, test/recency-decay.test.ts, and the existing entries — describing what the rule forbids requires naming it. Without this allowlist, `bun run verify` fails on check:privacy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.35.1.0) Temporal-contradiction-probe wave — Phase 1 of the RFC at docs/proposals/temporal-contradiction-probe.md. Headline: the contradiction probe now classifies pairs into a 6-member verdict enum (no_contradiction, contradiction, temporal_supersession, temporal_regression, temporal_evolution, negation_artifact) and sees the page-level effective_date for each chunk via a (from: YYYY-MM-DD) tag in the prompt. The pre-judge date filter no longer skips dated wide-gap pairs, so the role-transition class (e.g. a 2017 historical record vs. a 2025 current state) reaches the judge and gets classified as temporal_supersession instead of vanishing into the skip bucket. PROMPT_VERSION bumped 1 → 2 (cache fully invalidated). Three-layer cost guardrail: TTY-only cost-estimate prompt with Ctrl-C window, --budget-usd hard cap, Haiku-tier routing via new models.eval.contradictions_judge config key. Also adds a CI privacy lint (scripts/check-proposal-pii.sh) wired into bun run verify that catches PII patterns in docs/proposals/*.md so future RFCs can't ship with personal-context vocabulary the way this wave's source RFC did at draft time. Phases 2-4 deferred to follow-up RFCs per the plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-name resolver + 58x perf + stub guard observability (garrytan#1085) * fix(doctor,entities): supervisor crash classification + bare-name resolver + stub guard - doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes vs code === 0 clean restarts (separate counter); fixes false-positive WARN on healthy supervisors - entities/resolve.ts: prefix-expansion step between fuzzy match and slugify fallback catches bare first names that score too low on pg_trgm; picks highest-connection candidate as tiebreaker - facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed entity pages at brain root - facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact so the fact still persists even when no markdown file is created - docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor health-score improvements - .gitignore: guard reports/network-intelligence/ (private brain exports) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(privacy): scrub real names from entity-resolve test fixtures and JSDoc Replace YC partner names with placeholders per CLAUDE.md privacy rule: alice-example, bob-example, charlie-example, dave-example. Stripe and Stripe Atlas retained (allowed household brands; exercises the two-word company-prefix case). Test semantics preserved: - Alice / Dave: single-match cases - Bob / Charlie: multi-match tiebreaker cases (winner has more chunks) All 13 entity-resolve cases pass with the scrubbed fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(supervisor): extract classifyWorkerExit() helper (DRY) Three call sites were inline-classifying worker exits: supervisor's restart policy (child-worker-supervisor.ts:291), doctor's supervisor check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same rule, three copies — drift risk if one is updated without the others. Extract to src/core/minions/exit-classification.ts as a pure function. Signature consumes audit-JSON shape ({ code: number | null }) so doctor and jobs (which read serialized events from JSONL) and supervisor (which reads Node's exit callback) call the same function. Helper's classification rule: code === 0 → clean_exit, everything else (non-zero, null, undefined, missing) → crash. Default-to-crash prevents corrupted rows from silently demoting into the clean-restart bucket. 5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(facts): audit + sunset comment for stub-guard fires Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every guard fire now appends a JSONL line to ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id, fact_count}. Operator visibility for the sunset criterion: when the new audit log reads <5 hits/week for 3 consecutive weeks on production brains, the prefix-expansion in resolveEntitySlug is sufficient and the guard can be removed in v0.36. Reader (readRecentStubGuardEvents) deliberately diverges from supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND previous ISO-week file before filtering by ts. supervisor-audit's reader only reads the current week, which loses 24h-window correctness across Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The 2-file read costs nothing and makes the window actually 24h. 9 hermetic unit tests pin filename math, the writer's swallows-errors contract, the cross-week-boundary read, sort order, missing-file behavior, and malformed-row tolerance. The cross-week test is the regression guard: if a future refactor copies the supervisor's single-file pattern, that test fails. Follow-up TODO (not in this PR): fix readSupervisorEvents to use the same 2-file pattern. The new stub-guard reader becomes the canonical template to copy back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(doctor): stub_guard_24h check surfaces resolver gaps Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl (via the dual-week-aware reader from T8) and surfaces the 24h fire count. WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug is probably missing a case (typo prefix, alias, non-Latin script) and operators should grep the audit log for the offending slugs. Below the threshold but non-zero shows as OK with a count, so operators can watch the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed). Zero hits emits no check, keeping the doctor output clean on healthy brains. 5 source-grep regression tests pin the contract: check name, WARN threshold, fix hint mentions the audit log + the resolver function name, reader is the dual-week-aware variant (NOT the supervisor-audit single- week pattern), and zero-hits stays silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(facts): pin stub-guard contract at writeFactsToFence + backstop layers - fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs return {inserted: 0, stubGuardBlocked: true, ids: []} and create no file/.tmp at brain root. Prefixed slugs bypass the guard (regression guard against accidentally inverting the slug.includes('/') check). Empty facts array short-circuits before the guard fires. - facts-backstop.test.ts: 1 new case for the end-to-end routing. A bare-name LLM extraction resolves through to a bare slug, hits the guard, and lands in the facts table via engine.insertFact (DB-only). No phantom .md file; entity_slug stores the bare slug; source_markdown_slug is null. This is the routing contract Codex flagged as a "split-brain" data shape — the test pins the by-design behavior so a future refactor can't silently drop these facts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(supervisor): pin classifyWorkerExit consumer wire-up + regressions 12 new cases on top of the 5 helper unit tests: - doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper - All three call classifyWorkerExit at least once - doctor.ts and jobs.ts no longer carry the pre-T7 inline filter - supervisor uses the helper result to choose the clean_exit branch - audit-event shape round-trip: code=0 → clean_exit, code=1 → crash, code=null+SIGKILL → crash (catches future shape changes) The regression guards (3) and the wire-up checks (6) close the gap that motivated T7 in the first place: if a future change accidentally re-inlines the filter or shifts the audit event shape, the test fails before production sees the silent divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(entities): correlated subqueries scoped to slug-LIKE candidates Replace the derived-table JOIN shape in tryPrefixExpansion with correlated subqueries. The pre-fix SQL did LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ... which forced the planner to aggregate the entire links + content_chunks tables on every prefix-expansion call — O(N) per call where N is total links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow enough to bottleneck fact-extraction. New shape uses correlated subqueries: (SELECT COUNT(*) FROM links WHERE to_page_id = p.id) + (SELECT COUNT(*) FROM links WHERE from_page_id = p.id) + (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id) The slug LIKE filter is already selective (typical brain has 0-5 pages per prefix), so the three subqueries run N≈3 times per matched row against the existing indexes on links.to_page_id, links.from_page_id, and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve tests pass (single-match + multi-match tiebreaker + edge cases). Codex's outside-voice review caught the dead-end design that an earlier draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would have excluded correct high-connection candidates if their slug sorted late). Correlated subqueries without a candidate cap are the cleaner shape that lets the LIKE filter do the bounding work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(entities): perf regression guard for prefix-expansion (58x speedup) Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs the pre-T12 derived-table shape and the new correlated-subquery shape side-by-side against the same fixture, asserts NEW >= 5x faster than OLD. Baseline-ratio, not absolute wall-clock — different machines / Bun versions / CI load can shift absolute timings by 10x without indicating a real regression, but the SHAPE difference between "aggregate the full tables" and "correlated subquery per candidate" is what we care about. Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x. The 5x assertion has plenty of headroom. The OLD SQL is embedded verbatim as the regression baseline. If a future refactor re-introduces full-table aggregation (LEFT JOIN against SELECT...GROUP BY over the whole links or content_chunks table), the test fails. PGLite-only — Postgres planner can shape derived-table JOINs differently enough that the 5x ratio could be noise on a 5K-page fixture. The structural correctness of the rewrite is the same on both; this is purely a planner-shape regression guard. .slow.test.ts suffix keeps it out of the fast loop (run via `bun run test:slow`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.35.2.0) Wave content: - Privacy scrub: PII rebuilt out of branch history; real names → placeholders - Bug fix: doctor + jobs no longer count clean worker exits as crashes - Bug fix: entity resolver prefix-expansion catches bare first names - DRY refactor: classifyWorkerExit() helper (one rule, 3 call sites) - Observability: stub_guard_24h doctor check + ISO-week audit log - Perf: 58x speedup on tryPrefixExpansion query shape Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: rebump v0.35.2.0 → v0.35.4.0 + scrub TODOS.md privacy violation VERSION/package.json/CHANGELOG header rebumped to v0.35.4.0 per user request (queue allocation). TODOS.md rephrased to not literally name the banned private-agent string — that was the CI failure root cause on the v0.35.2.0 push. CHANGELOG.md is on check-privacy.sh's allow-list (meta-documentation exception); TODOS.md is not. CI re-runs against this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er (garrytan#1111) * fix(bootstrap): extend probes for files/oauth_clients/sources.archived* + add MIGRATIONS introspection guard Adds 7 new forward-reference probes to applyForwardReferenceBootstrap on both engines, closes the column-only forward-ref class via a new MIGRATIONS-source introspection contract test. New probes: - files.source_id + files.page_id (v18 forward refs) - oauth_clients.source_id + oauth_clients.federated_read (v60+v61+v65) - sources.archived + archived_at + archive_expires_at (v34 promoted from JSONB) The sources.archived* columns are the codex-flagged class: they're added inline in v34's CREATE TABLE definition but `CREATE TABLE IF NOT EXISTS sources` is a no-op on pre-v34 brains, so downstream visibility filters (search/list_pages) trip on old brains. needsPagesBootstrap now folds archive columns into its CREATE TABLE so pre-v0.18 brains get a v34-shape sources in one go; needsSourcesArchive then only fires on the pre-v34 case (sources exists, archive cols don't). Closes the structural bug class via test/helpers/extract-added-columns.ts: reads src/core/migrate.ts as text and extracts every ALTER TABLE ADD COLUMN. The new contract test asserts every (table, column) pair is covered by EITHER the bootstrap's ALTER TABLE statements, the bootstrap's CREATE TABLE definitions, OR the schema blob's CREATE TABLE bodies. The column-only class (no index, no FK; just an inline CREATE TABLE column the schema blob can't add to existing tables) is now caught at PR time. Source-text introspection catches all three migration shapes uniformly: - top-level `sql:` field - `sqlFor.postgres` / `sqlFor.pglite` overrides - handler-body `engine.runMigration(N, \`ALTER TABLE ...\`)` (v34 shape) Pre-existing parseBaseTableColumns parser bug fixed: now strips `--` line comments and `/* ... */` blocks before identifying column names. Without this, a column preceded by a comment was silently dropped. Catches pages.page_kind and others that were silently uncovered. 13 columns added by migrations but not in PGLITE_SCHEMA_SQL are exempted with a unified rationale: they have no schema-blob forward reference; migration handles all upgrade paths cleanly. Refreshing the schema blob is a separate concern. Issues closed: garrytan#1018 (v60 oauth_clients), garrytan#974 (files.source_id/page_id), garrytan#820 (v0.13.0 migration files.page_id cascade); pre-empts the sources.archived class before any pre-v34 brain trips on it. Tests: - 9 cases in test/schema-bootstrap-coverage.test.ts (5 existing + 4 new) - helper-level unit tests cover SQL shape variants (IF NOT EXISTS, quoted identifiers, ALTER TABLE IF EXISTS ONLY, multi-statement) - planted-bug regression verifies the gate actually catches new uncovered columns Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(orphans): filter soft-deleted pages on both candidate and link-source sides Closes garrytan#1021. The v0.26.5 soft-delete invariant requires that findOrphanPages exclude both: 1. Candidate pages that are themselves soft-deleted 2. Inbound links from soft-deleted source pages Pre-fix, findOrphanPages had no deleted_at filter at all. Soft-deleted pages with no inbound links were counted as orphans (inflating counts). Pre-codex-tension-D11, only the candidate-side filter was planned. Codex C11 caught the second case: a live page that has ONE inbound link from a soft-deleted source page was hidden from orphan results — the link still existed in the links table, the EXISTS subquery saw it, the page looked "linked." Now the inner JOIN on pages enforces src.deleted_at IS NULL. Three regression tests pin the contract: - soft-deleted page with no inbound → NOT orphan - live page with ONLY inbound link from soft-deleted source → IS orphan - live page with live inbound → NOT orphan (smoke check that the new filters don't break unchanged behavior) Engine parity: same SQL shape on both Postgres and PGLite engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(think): route runThink through gateway.chat adapter (closes garrytan#952) Pre-fix, runThink instantiated `new Anthropic()` directly and read ANTHROPIC_API_KEY from process.env. Claude Desktop's stdio MCP launch doesn't inherit shell env, so `gbrain config set anthropic_api_key sk-...` (writes to ~/.gbrain/config.json) never reached the SDK and every MCP think call degraded to "no LLM available." The adapter routes through gateway.chat() — the canonical seam per CLAUDE.md. Gateway reads the API key from gbrain config OR env, picks up prompt caching, rate-leases, retry, and the test seam (__setChatTransportForTests) that v0.31.12 established. Per plan-eng-review D10 (cross-model tension with codex C7+C8+C9+C10), four spec points landed: 1. Drop `new Anthropic()` direct path entirely. Every non-stub LLM call from runThink routes through gateway. 2. Real availability check (NOT a false-positive `getChatModel()` truthy). `tryBuildGatewayClient` probes both the recipe (resolveRecipe throws AIConfigError on unknown providers) AND the API key (reads process.env + loadConfig at the gbrain config layer for parity with gateway's own auth resolution). Returns null on miss; runThink takes the graceful "no LLM available" early-return preserving the legacy NO_ANTHROPIC_API_KEY warning signal. 3. Model-id normalization. resolveModel returns bare anthropic ids (claude-opus-4-7); gateway.chat needs provider:model. Adapter auto-prefixes anthropic: when the id is bare. Provider:model strings pass through unchanged. 4. Response-shape conversion. ChatResult → Anthropic.Message via chatResultToMessage. mapStopReason translates gateway's provider-neutral stop reasons (end / length / tool_calls / refusal / content_filter / other) to Anthropic's stop_reason ('end_turn' / 'max_tokens' / 'tool_use'); refusal/content_filter/other fall through to end_turn (no Anthropic equivalent). Usage tokens pass through. `opts.client` injection preserved (test seam — see ThinkLLMClient). `opts.stubResponse` preserved (pure-test escape). Tests: - test/think-gateway-adapter.test.ts (9 cases): response shape, stop reason mapping, model-id normalization (bare + prefixed), provider unknown returns null, ANTHROPIC_API_KEY absent returns null (regression for legacy graceful degradation), hasAnthropicKey reads process.env correctly. Uses withEnv per the test-isolation contract. - test/think-pipeline.serial.test.ts (17 existing cases): unchanged; the graceful-degradation case at line 213 still produces the NO_ANTHROPIC_API_KEY warning because tryBuildGatewayClient returns null when no key is configured, taking the legacy early-return path. Closes garrytan#952. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sync): distinguish git worktree from submodule via path-segment match (closes garrytan#889) Pre-fix, `manageGitignore` treated every `.git`-as-file as a submodule and skipped gitignore management. Both submodules AND worktrees use `.git` as a file (not a directory), so the legacy `statSync.isFile()` check couldn't discriminate. Worktrees got misclassified as submodules and their .gitignore wasn't managed. Per plan-eng-review D4 (chose path-segment match over absolute-vs- relative path heuristic): the gitdir path contains: - `/modules/<name>` for submodules (skip — managed by parent repo) - `/worktrees/<name>` for worktrees (MANAGE — first-class repo) Both are documented Git internal layouts, stable across all 4 {relative, absolute} × {modules, worktrees} combinations including the absorbed-submodule edge case from `git submodule absorbgitdirs` (where the submodule's gitdir flips to an absolute path). Malformed `.git` file (no `gitdir:` prefix, IO error) → MANAGE, preserving the pre-garrytan#889 catch{} fail-closed-toward-managing semantics. Tests (5 new + 1 regression renamed): - REGRESSION: submodule relative gitdir/modules/ → skip (D49 contract) - absorbed submodule absolute gitdir/modules/ → skip (edge case) - CRITICAL: worktree absolute gitdir/worktrees/ → MANAGE (closes garrytan#889) - worktree relative gitdir/worktrees/ → MANAGE - malformed .git file → MANAGE (preserves catch behavior) - regular .git directory → MANAGE (existing smoke) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(walkers): pruneDir helper + descent-time exclusion + transcript predicate (closes garrytan#923, garrytan#202) Per plan-eng-review D12 (cross-model tension with codex C12+C13), three structural changes: 1. Extract `pruneDir(name)` helper in src/core/sync.ts. Returns false for directory names walkers must NEVER descend into: `node_modules` (latent bug — no leading dot), dot-prefix dirs (`.git`, `.obsidian`, `.raw`, `.cache`, etc.), `ops`, and `*.raw` sidecar dirs (gbrain convention — `people/pedro.raw/` holds raw source for pedro.md). Walkers consult it at descent time BEFORE recursion, saving the IO cost of walking entire vendor / hidden / sidecar subtrees only to filter them at file-emit time. 2. `isSyncable` itself gains the same exclusion set (via pruneDir on each path segment). Closes the latent bug where node_modules markdown files slipped through: `node_modules/some-pkg/README.md` returned true pre-fix because the legacy dot-prefix check only blocked `.node_modules` (with a leading dot), not the actual `node_modules`. CRITICAL regression test in test/sync.test.ts pins the contract per IRON RULE. 3. Two walkers rewritten to use pruneDir at descent + per-walker file predicate at emit: - `walkMarkdownFiles` (src/commands/extract.ts): pruneDir + isSyncable ({strategy:'markdown'}). Pre-fix this walker had ONLY an ad-hoc dot-prefix exclusion and didn't call isSyncable at all — descended into node_modules, emitted markdown files from there, ignored README/ ops/.raw filters. - `listTextFiles` (src/core/cycle/transcript-discovery.ts): pruneDir + own .txt/.md predicate. DOES NOT use isSyncable({strategy:'markdown'}) because transcripts accept .txt and don't share markdown sync's README/ops exclusions (codex C12). Also made RECURSIVE — pre-fix it walked only the top dir, so transcripts in `corpus/2026/` were invisible (codex C14 — descent-time pruning is the right shape but the test would have passed vacuously on a non-recursive walker). Verified blast radius before adding node_modules: every existing isSyncable caller (sync.ts:558-561 sync filter, frontmatter.ts:264 validate, brain-writer.ts:305 reverse-write, import.ts:454 import filter) wants node_modules excluded — this is a latent-bug fix, not a behavior change for any legitimate caller. Tests: - 7 new isSyncable cases including the node_modules CRITICAL regression - 6 new pruneDir cases (node_modules, dot-prefix, ops, *.raw, content dirs that should pass, empty-string default) - Existing extract.test.ts + extract-fs.test.ts unchanged and passing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): file v0.36.x follow-ups for runThink rewrite + Supabase bootstrap parity Two follow-up TODOs filed during the v0.36 dreamy-thompson wave: 1. runThink full rewrite (D5+D7 from plan-eng-review): drop the ThinkLLMClient indirection now that v0.36 routes through gateway.chat. 12+ tests need migration to __setChatTransportForTests. Blocked by this wave landing. 2. Supabase parity test for applyForwardReferenceBootstrap (codex C6 residual): real Docker Postgres E2E catches schema correctness but not Supabase pooler/direct-pool routing. The probe uses this.sql but PostgresEngine.initSchema chooses a DDL connection; the divergence has caused multiple historical wedges (garrytan#699, garrytan#820 lineage). Both entries include full context per the CLAUDE.md TODOS-format spec (what, why, pros, cons, blocked-by, plan reference). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap): thread DDL connection through applyForwardReferenceBootstrap Codex adversarial review during /ship caught a P1: initSchema selected a DDL connection, took pg_advisory_lock(42) on it, but applyForwardReferenceBootstrap used `this.sql` (the instance pool) inside. Bootstrap probes ran outside the lock scope on a different connection. Failure mode: two concurrent gbrain instances could BOTH enter the bootstrap block on Supabase transaction-pooler setups because the advisory lock was held on a different connection than the one running ALTER TABLE. The pooler's statement_timeout could also kill the probes mid-flight without affecting the lock-holder, leaving an inconsistent schema state. Fix: applyForwardReferenceBootstrap now accepts an optional connection parameter. initSchema passes the DDL conn (the one holding the lock). this.sql remains the fallback for any unit-test path that calls bootstrap directly. PGLite engine doesn't need this change — single connection, no pooler. This was pre-existing (every prior probe used this.sql), but the v0.36 wave is explicitly about fixing the Supabase upgrade-wedge class. Codex's position was correct: don't ship the wave with the underlying connection mismatch still there. The Supabase parity TEST FIXTURE follow-up remains on TODOS.md (test infra needed to PROVE the fix works under real pooler topology), but the bug itself is closed. 15/15 bootstrap tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.35.5.0) Six-correctness-fix wave: bootstrap forward-ref class (4 issues + 1 pre-empt), orphans soft-delete leak (both sides), runThink → gateway.chat adapter, git worktree vs submodule discriminator, walker pruneDir + descent-time exclusion, plus a Codex-P1 catch during /ship that threaded the DDL connection through applyForwardReferenceBootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CLAUDE.md for v0.35.5.0 backend correctness wave Fold v0.35.5.0 file-level annotations into CLAUDE.md: - postgres-engine.ts + pglite-engine.ts: 7 new applyForwardReferenceBootstrap probes (files.source_id/page_id, oauth_clients.source_id/federated_read, sources.archived/archived_at/archive_expires_at) + DDL connection threading - test/schema-bootstrap-coverage.test.ts: new MIGRATIONS-source introspection guard + parseBaseTableColumns comment-stripping fix - src/core/sync.ts: new pruneDir helper + manageGitignore worktree discriminator - src/core/think/index.ts (new entry): runThink gateway adapter for MCP stdio key resolution - src/core/operations.ts (new entry): findOrphanPages soft-delete filter Regenerate llms-full.txt via bun run build:llms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan#1108) * feat(supervisor-audit): shared isCrashExit + summarizeCrashes classifier Adds the read-side foundation for reading `likely_cause` off `worker_exited` audit events. Denylist semantics — only `clean_exit` and `graceful_shutdown` are non-crashes. Future unrecognized causes surface by default. `isCrashExit(event)` classifies a single audit event with legacy `code !== 0` fallback for pre-v0.34 entries lacking `likely_cause`. `summarizeCrashes(events)` aggregates a 24h window into a `CrashSummary` with per-cause counts (runtime_error, oom_or_external_kill, unknown, legacy) and a `clean_exits` total. Both helpers live next to `readSupervisorEvents` so the producer (the JSONL writer) and the consumers (doctor + jobs CLI) share one regression point. Test matrix pins all 9 isCrashExit branches plus 5 summarizeCrashes aggregation cases including the future-cause denylist regression guard. * fix(doctor,jobs): wire supervisor check to summarizeCrashes `gbrain doctor` and `gbrain jobs supervisor status` both counted every `worker_exited` audit event as a crash, regardless of `likely_cause`. After v0.34.3.0 added RSS-watchdog drains (code=0), the count inflated to 120+/day on a healthy brain — the alarm pattern users reported. Both surfaces now go through `summarizeCrashes(events)` (single regression point, can't drift). The warn threshold drops from `>3` to `>=1` now that the counter is calibrated; the per-cause breakdown (runtime=N oom=M unknown=K legacy=L) gives operators triage context in the message without grep'ing the JSONL audit. `gbrain jobs supervisor status --json` adds `crashes_by_cause` and `clean_exits_24h` fields so monitoring dashboards bind to the named buckets. 4 source-grep wiring assertions in doctor.test.ts pin both call sites against drift. * chore: bump version and changelog (v0.35.5.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: document v0.35.5.0 supervisor-audit crash classifier Add CLAUDE.md entry for src/core/minions/handlers/supervisor-audit.ts covering the new isCrashExit/summarizeCrashes/CrashSummary/CLEAN_EXIT_CAUSES exports. Extend doctor.ts and jobs.ts entries with the v0.35.5.0 wire-up: shared helper, denylist semantics, >=1 warn threshold, per-cause breakdown in messages, crashes_by_cause + clean_exits_24h in JSON. Regenerate llms-full.txt to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…loses garrytan#1091) (garrytan#1129) * v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages Opt-in score-based gate on the three metadata-axis boost stages (backlink, salience, recency) inside `runPostFusionStages`. When `SearchOpts.floorRatio` or `search.floor_ratio` config is set, each stage skips results whose post-cosine-rescore score is below `floorRatio * topScore`. Default undefined preserves prior behavior bit-for-bit. Prevents weak-overlap candidates from accumulating metadata boosts and leapfrogging the legitimate primary hit on dense-embedder corpora. Built on the contributor PR from @jayzalowitz (PR garrytan#1091, SkyTwin twin-memory layer). Refactored on top: threshold is computed ONCE at runPostFusionStages entry instead of per-stage (single-baseline semantic, order-independent); knobsHash bumped 2->3 so a no-floor cache write can't be served to a floor-enabled lookup; NaN scores skip the boost instead of bypassing the gate; SearchOpts/config/MODE_BUNDLES integration replaces the PR's PostFusionOpts-only surface; no env var (resolveSearchMode is pure by design). Three correctness issues codex outside-voice review caught and this landed with fixed: - Cache contamination via knobsHash() (same bug class as v0.32.3 CDX-4 hotfix for the other search-lite knobs) - NaN scores would have bypassed the gate (NaN < threshold is false in JS); realistic on Voyage flexible-dim / zembed-1 Matryoshka dim drift - Negative top scores would have broken the "single result trivially eligible" claim; gate now disables on no-positive-signal inputs Scope: gates metadata stages only. Exact-match boost (applyExactMatchBoost) runs independently as a lexical-relevance signal by design. Cross-source floor stays global (per-source deferred to v0.36 if federated-read users hit the suppression). Default-on for any mode bundle deferred until gbrain-side ablation against longmemeval / whoknows / suspected-contradictions / BrainBench-Real (TODOS.md). Plan + 9-decision review trail (D1-D9): ~/.claude/plans/swift-sniffing-nygaard.md. Empirical motivation, failure-mode framing, dense-embedder targeting, and the 0.85 starting value all from @jayzalowitz's labeled-retrieval ablation. Integration shape is gbrain-side. Test surface: 30+ new cases (computeFloorThreshold edge cases including T1a NaN / T1b negative top, three boost-function gate parity tests including T6 IRON-RULE applyRecencyBoost regression, runPostFusionStages single-baseline composition pin, KNOBS_HASH_VERSION bump from 2 to 3, floor-ratio-changes-hash cache-contamination prevention, loadOverridesFromConfig coverage for search.floor_ratio config key). bun run verify clean; full unit suite 6753 pass / 0 fail. Co-Authored-By: Jay Zalowitz <jayzalowitz@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: rewrite v0.35.6.0 CHANGELOG ELI10-lead-first; codify the rule in CLAUDE.md CHANGELOG entry for v0.35.6.0 was readable only by someone who already understood gbrain's internals (RRF, knobsHash, MODE_BUNDLES, runPostFusionStages, Matryoshka, CDX-4). Rewrote it so the first ~150 words explain what shipped in everyday English, with a concrete worked example, before any file paths or function names appear. Itemized changes section keeps the technical precision for engineers who need it. Then codified the rule in CLAUDE.md so future release entries land the same way. The "Release-summary template" section now has an iron rule: "lead ELI10, get precise after." No file paths or internal constants in the first 150 words; user-visible behavior change first; everyday-language column headers in any tables. Technical precision is required (the entry is still the technical record) but lives BELOW the plain-English lead, never before it. Smell test: if a reader who has never opened gbrain can walk away from the first 150 words knowing what shipped and whether they care, the entry passes. bun run build:llms regenerated to pick up the CLAUDE.md change (CI guard test/build-llms.test.ts pins committed bundles against fresh generator output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jay Zalowitz <jayzalowitz@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arrytan#1131) * feat(facts): typed-claim substrate + cycle correctness fixes (v0.35.6 wave 1/3) Schema (migration v67): - Add four optional typed-claim columns to facts: claim_metric TEXT, claim_value DOUBLE PRECISION, claim_unit TEXT, claim_period TEXT - Partial index facts_typed_claim_idx ON (entity_slug, claim_metric, valid_from) WHERE claim_metric IS NOT NULL - All nullable, metadata-only on both engines Fence layer: - ParsedFact (facts-fence.ts) gains optional claimMetric/Value/Unit/Period - Parser tolerates both 10-cell (legacy) and 14-cell (widened) rows - Renderer emits 14 cells iff any row has typed data; otherwise stays 10-cell so existing fences don't widen on unrelated edits - Numeric value cell tolerates comma thousand separators (50,000 -> 50000) Extract pipeline (D-CDX-2, D-ENG-1): - src/core/facts/extract.ts (the actual Haiku call site, NOT extract-facts.ts cycle phase) extends its system prompt to emit typed fields for metric-shaped claims - extractFactsFromFenceText gains optional pageEffectiveDate. Precedence: fence-row validFrom > pageEffectiveDate > undefined (engine defaults to now) - normalizeMetricLabel: 15-entry seed map for common founder metrics (mrr, arr, runway, headcount, team_size, cac, ltv, gross_margin, burn_rate, cash, users, mau, dau, churn_rate, revenue); unknown labels lowercase + space->_ Engine extensions: - NewFact + insertFact + insertFacts in both engines accept the four typed columns (all nullable) - Cycle phase extract-facts.ts threads page.effective_date through AND batch-embeds via gateway.embed() before insertFacts (D-CDX-3 fix for cycle-inserted facts arriving with embedding=NULL) Consolidate fix (D-CDX-4 — Codex F4): - Replace MAX(row_num)+1 INSERT with semantic upsert on (page_id, claim, since_date). Re-running the full cycle on stable input produces zero new takes — fixes the pre-existing duplicate-takes bug after extract_facts wipes consolidated_at - Chronological valid_until writeback per cluster: sort by (valid_from ASC, id ASC), walk pairs, set older.valid_until = newer.valid_from Tests: - test/migrate.test.ts +6 cases for v67 shape + materialization + nullable backward compat - test/facts-fence-typed.test.ts (new, 17 cases): parser+renderer round-trip, normalization seed map coverage, valid_from precedence three-branch - test/consolidate-valid-until.test.ts (new, 4 cases): chronological writeback (R4a), same-day id tiebreaker, cycle re-run zero duplicates (R4b/R7), valid_until idempotency - test/schema-bootstrap-coverage.test.ts: add four typed-claim columns to COLUMN_EXEMPTIONS (migration co-defines the partial index, no forward reference to bootstrap) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(trajectory): find_trajectory MCP op + eval/founder CLIs (v0.35.6 wave 2/3) Engine method (D-CDX-1, D-CDX-6): - BrainEngine.findTrajectory(opts) on both Postgres and PGLite - TrajectoryOpts: scalar sourceId fast path + sourceIds federated array (mirrors v0.34.1.0 search* dual pattern) - opts.remote: when true, SQL adds AND visibility='world' so OAuth read clients see only world-visibility facts (mirrors recall's posture — closes the F7 privacy regression Codex caught in plan review) - Single SQL query, ORDER BY valid_from ASC, id ASC for deterministic output (R3 pin). Returns TrajectoryPoint[] including raw embedding so the caller can compute drift without a second round-trip Pure function library (src/core/trajectory.ts, new): - detectRegressions(points, threshold): walks consecutive (metric, value) pairs per metric; emits when newer drops >= threshold below older. 10% default, override via GBRAIN_TRAJECTORY_REGRESSION_THRESHOLD - computeDriftScore(points): 1 - mean(cosine(emb[i], emb[i-1])) over embedded points; clamped [0,1]; null when <3 embedded points (D-ENG-3 graceful degradation) - computeTrajectoryStats(points): composed shape returning both - TRAJECTORY_SCHEMA_VERSION = 1 — additive-only across releases (R5) MCP op (src/core/operations.ts): - find_trajectory: scope read, NOT localOnly. Routes through sourceScopeOpts(ctx) for federated isolation AND threads ctx.remote for visibility filtering. Strips raw Float32Array embeddings from the wire shape; converts valid_from to YYYY-MM-DD string - Registered in operations array after find_experts - FIND_TRAJECTORY_DESCRIPTION in operations-descriptions.ts CLIs: - gbrain eval trajectory <entity> [--metric M] [--since D] [--until D] [--limit N] [--json] — chronological human view with [REGRESSION] inline annotation; thin-client routing via callRemoteTool(find_trajectory). Dispatched in src/commands/eval.ts sub-subcommand block - gbrain founder scorecard <entity> [--since D] [--until D] [--json] — pure aggregation over Phase 2's substrate. Four signals: claim_accuracy (over resolved takes), consistency, growth_trajectory, red_flags. computeFounderScorecard exported for tests. Registered as top-level command in cli.ts; added to CLI_ONLY set Tests (45 cases across 5 files): - test/engine-find-trajectory.test.ts: 18 cases — chronological order, source scoping (scalar + federated), visibility filter on remote=true, metric + since/until filters, regression detection at threshold boundaries, drift score with various embedding states - test/operations-find-trajectory.test.ts: 9 cases — op registration, param validation, JSON envelope shape, R5 schema_version: 1, embedding stripped from wire, R6 visibility filter, source scoping - test/eval-trajectory.test.ts: 7 cases — arg parsing, --help, --json envelope, regression annotation, --metric filter, empty entity - test/founder-scorecard.test.ts: 9 cases — empty inputs no-NaN (G2), claim_accuracy math, consistency math, growth_trajectory math, red_flags fire for regression / narrative_drift / missed_prediction - test/eval-contradictions/no-valid-until-write.test.ts: 4 cases — R1 (probe never writes valid_until under eval-contradictions/) + R8 (only allow-listed files write valid_until anywhere in src/) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: v0.35.6.0 — CHANGELOG + VERSION + docs + migration note Bumps to v0.35.6.0 (next-minor after master's v0.35.5.1 — typed-claim substrate + trajectory + founder scorecard is a new user-facing feature surface, not a fix). - VERSION + package.json synced - CHANGELOG.md release-summary block in the wave-style voice, lead with what the user can now DO. Sections: typed metric claims in the fence, chronological metric trajectories, founder scorecard, MCP find_trajectory op, cycle re-run idempotency fix, embedding-on-insert fix, valid_from precedence fix. To-take-advantage-of block with verification + opt-in fence syntax example - CLAUDE.md Key Files entry consolidating the wave across eval-trajectory.ts + founder-scorecard.ts + trajectory.ts. Names every D-ENG / D-CDX decision and the Codex outside-voice F-numbers - skills/migrations/v0.35.6.md agent-readable migration note. Includes fence-syntax example for typed-claim rows so downstream agents start emitting them. Iron-rule contracts called out (R1 + R8 + R7 + visibility) - llms-full.txt regenerated to reflect the new CLAUDE.md entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: post-ship sync for v0.35.7.0 — trajectory + founder scorecard - README.md: add `gbrain eval trajectory` to EVAL section, add new TEMPORAL block covering `gbrain founder scorecard` + the GBRAIN_TRAJECTORY_REGRESSION_THRESHOLD env override; add v0.35.7 "What's new" paragraph below the v0.28.8 LongMemEval blurb - AGENTS.md: new bullet under Common tasks teaching agents to reach for `gbrain eval trajectory` / `gbrain founder scorecard` / the `find_trajectory` MCP op when asked to evaluate a founder/company over time - docs/contradictions.md: append "Temporal axis follow-on (v0.35.3.1 + v0.35.7)" subsection under See also, cross-linking the trajectory substrate and naming the auto-supersession.ts:4 invariant preserved by both the verdict enum (probe side) and consolidate's valid_until writeback (cycle side) - CLAUDE.md: fix stale (v0.35.4) tag on the trajectory entry to (v0.35.7) — version got rebumped twice during the merge wave - skills/migrations/v0.35.7.md renamed to v0.35.7.0.md for consistency with the v0.35.0.0.md / v0.14.0.md / etc naming convention - llms-full.txt regenerated to reflect the CLAUDE.md edit Coverage map (Diataxis): /eval trajectory CLI ✅ ref (README, AGENTS) ✅ how-to (CHANGELOG) ❌ tutorial /founder scorecard CLI ✅ ref (README, AGENTS) ✅ how-to (CHANGELOG) ❌ tutorial find_trajectory MCP op ✅ ref (CLAUDE.md, AGENTS, contradictions.md) typed-claim fence cols ✅ ref (skills/migrations/v0.35.7.0.md, CHANGELOG) Migration v67 ✅ ref (CLAUDE.md, CHANGELOG) No tutorial / explanation gaps worth filling in this PR — the migration note's fence-syntax example already covers the "first typed claim" walkthrough. ARCHITECTURE diagrams not drifted (the trajectory work extends existing facts/takes infrastructure; no new component boxes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rrytan#1138) * feat(cycle): phantom-page redirect inside extract_facts (v0.35.8.0) Drains the existing pile of unprefixed entity pages (alice.md, acme.md) that pre-PR-garrytan#1010 routing left behind. Folds the cleanup into the existing extract_facts cycle phase via two new lossless engine primitives so the v0.32.2 reconciliation contract owns drift handling instead of a parallel implementation duplicating it. Layers: - engine: refreshPageBody + migrateFactsToCanonical on Postgres + PGLite - resolver: resolvePhantomCanonical + findPrefixCandidates (codex #1/#11) - orchestrator: src/core/cycle/phantom-redirect.ts + phantom-audit JSONL - cycle: sourceId/brainDir threaded; 3 new totals counters - tests: 38 unit + 6 parity + 4 E2E (48 total) pinning all 12 codex findings * fix(test): pin clock in sync_freshness boundary tests (CI flake) CI test (1) failed: `sync_freshness check > exact 72h boundary → warn`. The test set `last_sync_at = Date.now() - 72h`, then checkSyncFreshness called Date.now() again to compute ageMs. Between the two reads the clock advanced (0.43ms in this CI run, microseconds locally) which pushed ageMs above the strict 72h fail threshold and flipped the status from warn to fail. Same shape latent in the 24h boundary test — fixed both. Fix: - checkSyncFreshness gains an optional `opts.nowMs` test-only seam. Production callers omit it and get live wall-clock semantics. - Both boundary tests now capture nowMs once and thread it through both `last_sync_at` and the check, eliminating drift between reads. Verified deterministic: 10 consecutive runs of the 72h boundary test pass on this machine (was occasionally failing before).
…aged-block install) (garrytan#1130) * feat(skillpack): extract copyArtifacts shared helper (T1) Pure file-copy primitive for scaffold (gbrain→host) and harvest (host→gbrain). Atomic-refusal contract: symlink-reject + canonical-path containment validate every item before any write. Used by both directions of the v0.33 loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): scaffold subcommand + SKILL.md frontmatter sources (T2) New scaffold.ts replaces the managed-block installer. One-time additive copy into the user's repo via copyArtifacts; refuses to overwrite existing files (user owns them). Partial-state policy: copies missing paired sources even when the skill dir already exists. bundle.ts extended with loadSkillSources + enumerateScaffoldEntries — paired source files declared in each SKILL.md's frontmatter sources: array, not in openclaw.plugin.json. Single source of truth, co-located with the skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): reference command + apply-clean-hunks (T4 + T15) reference is the read-only diff lens with an agent-readable framing line. Pure-JS unified-diff producer + parser + applier (no patch(1) dependency). Two-way merge with documented limitation: without scaffold-time base tracking, applied hunks align everything to gbrain. The agent dry-runs reference first, then decides. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): migrate-fence + scrub-legacy-fence-rows (T5 + T16) migrate-fence is the one-shot transition from the pre-v0.36 managed-block model. Strips begin/end markers and the cumulative-slugs receipt comment; preserves fence rows verbatim as user-owned routing during the transition to frontmatter discovery. Receipt-then-row fallback (F-CDX-8) covers stale/missing receipts. scrub-legacy-fence-rows is the opt-in cleanup after migrate-fence. Two-condition gate: removes a row only when skills/<slug>/ exists AND that skill's frontmatter declares non-empty triggers (proof frontmatter discovery covers it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): harvest + privacy linter (T6 + T7) The inverse loop: lift a proven skill from a host repo (~/git/wintermute, etc.) back into gbrain so other clients can scaffold it. --from <host-repo-root> is symmetric with scaffold's --workspace. Security: symlink rejection + canonical-path containment (mirrors validateUploadPath). Privacy: default-on linter scans harvested files against ~/.gbrain/harvest-private-patterns.txt plus built-in defaults (Wintermute, email, Slack channel patterns). Any match rolls back the copy and exits non-zero. --no-lint bypasses for the editorial workflow after a manual scrub. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(repo-root): cwd_walk_up tier for non-OpenClaw hosts (T9 + D3) autoDetectSkillsDir now walks up from cwd looking for any skills/ directory, ahead of the implicit ~/.openclaw/workspace fallback. cd ~/git/wintermute && gbrain skillpack scaffold ... finds wintermute automatically without requiring a RESOLVER.md/AGENTS.md to exist yet. R5 regression preserved: $OPENCLAW_WORKSPACE still wins when explicitly set. +5 test cases in test/repo-root.test.ts pin the new tier order and the R5 guard. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): rewrite CLI dispatch, drop install + uninstall (T3 + T10) skillpack.ts dispatcher rewritten for the v0.36 contract: scaffold, reference (+ --apply-clean-hunks), migrate-fence, scrub-legacy-fence-rows, harvest, plus the existing list / diff / check. install and uninstall are gone — both exit non-zero with a hint pointing at scaffold / migrate-fence. Clean break, no deprecated alias. skillpack-check gains --strict for CI gating. When invoked as the subcommand `gbrain skillpack check`, default is informational (exit 0 even with drift); --strict opts back into the cron-friendly exit-1-on-issues behavior. Top-level gbrain skillpack-check preserves its existing exit semantics for backwards compat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skills): skillpack-harvest editorial workflow + resolver wiring (T8) The companion editorial skill for the gbrain skillpack harvest CLI. Walks the genericization checklist (scrub fork names, generalize triggers, lift fork- specific conventions to references) before the CLI runs. Routing-eval fixtures use paraphrased intents to avoid the intent_copies_trigger lint. Wires the new slug into openclaw.plugin.json#skills, skills/manifest.json, and skills/RESOLVER.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(skillpack): 9-case real-subprocess E2E flow (T11) Spawns gbrain as a subprocess against tempdir workspaces. Covers: scaffold first-run + re-run no-op, reference diff + --apply-clean-hunks, migrate-fence, scrub-legacy-fence-rows, harvest privacy-lint catch + --no-lint bypass, and the install removed-error path. No DATABASE_URL needed — skillpack is filesystem-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: docs + VERSION + CHANGELOG for v0.36.0.0 (T13 + T14) Skillpacks as scaffolding, not amber. v0.36 retires the managed-block install model. Six new subcommands replace install + uninstall: scaffold, reference (with --apply-clean-hunks), migrate-fence, scrub-legacy-fence-rows, harvest, plus the existing list / diff / check (check gains --strict for CI gating). Routing comes from each skill's frontmatter triggers — gbrain does not touch your RESOLVER.md or AGENTS.md. Companion editorial skill skillpack-harvest drives the genericization checklist; default-on privacy linter catches Wintermute / email / Slack references before they leak into gbrain core. New docs guide at docs/guides/skillpacks-as-scaffolding.md walks the model and the migration path for pre-v0.36 installs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ci): privacy checks — allow-list harvest-lint tests, scrub user-facing fork-name references CI's check-privacy.sh and check-test-real-names.sh both flagged the literal fork name across the v0.36 skillpack diff. Two failure modes, two fixes: 1. **Meta-rule-enforcement files** added to both allow-lists. The harvest privacy linter's whole job is to catch the banned literal leaking into gbrain; its source has the regex pattern, its tests verify the linter fires by feeding it the banned string, and the skill markdown documents the substitution policy. Same exception status as check-privacy.sh and check-proposal-pii.sh themselves. Files allow-listed: - src/core/skillpack/harvest-lint.ts - test/skillpack-harvest-lint.test.ts - test/skillpack-harvest.test.ts - test/e2e/skillpack-flow.test.ts - skills/skillpack-harvest/SKILL.md 2. **User-facing references** swapped for canonical phrasing per CLAUDE.md's responsible-disclosure rule. README + new docs guide + 4 src docstrings + 1 test now say 'your OpenClaw' / 'host agent repo' / 'agentRepo' var name. Behavior unchanged — only documentation strings touched. Verify gate (the script CI runs) passes locally: EXIT=0. Tests still pass: 60/60 across the affected files. llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): update check-resolvable-cli expectation for cwd_walk_up tier Sister fix to the test/repo-root.test.ts update in commit a31418e. The new v0.33 cwd_walk_up tier fires before repo_root when running from inside the gbrain repo — same skills/ dir matched, different source label. Behavior unchanged; the legacy repo_root tier is now functionally subsumed (kept in the type union for back-compat). CI shard 3 failure: test/check-resolvable-cli.test.ts:171. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): pin clock in sync_freshness boundary tests (CI flake) The 24h and 72h exact-boundary tests scheduled last_sync_at relative to Date.now() at construction time, then let the check call Date.now() again internally. CI scheduler jitter between the two reads pushed ageMs past the strict > thresholds by microseconds, dropping the 72h-boundary case into the fail branch instead of warn. Fix: add an optional `opts.now` test seam to checkSyncFreshness. The two boundary tests now capture t0 once and pass it both to the timestamp constructor and to the check, making ageMs deterministically equal to the boundary. The non-boundary tests (4d, 30h, 2h, etc.) don't need pinning — they're comfortably away from the > comparison. CI shard 1 flake: test/doctor.test.ts:479. Locally 48/48 doctor tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): agent-onboarding readme + next-action hints on every CLI surface (DX review) DX audit of the v0.36 scaffold model surfaced one structural gap and four output gaps. When scaffolded files land on a downstream agent's disk, the agent had no agent-facing manifest telling it what to do — no routing contract, no upgrade flow, no two-way merge warning at the right surface. Fixes: 1. **New shared dep: skills/_AGENT_README.md.** Lands on every scaffold + migrate-fence alongside the existing _brain-filing-rules.md and _output-rules.md. Short, agent-readable contract: walk *.SKILL.md frontmatter triggers: for routing, gbrain is reference not law on upgrade, no managed-block fence anymore, two-way merge has known limitations. Single source of truth for the agent operating contract. 2. **scaffold stdout** prints a next-action hint pointing at the readme (with absolute path) and the reference --all upgrade-sweep command. 3. **reference stdout** adds per-category decision policy: - missing → scaffold again - differs → was edit intentional? keep it. Accidental? patch by hand or apply-clean-hunks after reading the two-way warning. 4. **reference --apply-clean-hunks** prints the two-way merge WARNING BEFORE the apply (to stderr, survives stdout redirect). Spells out that gbrain has no scaffold-time base and local edits in differing sections WILL be aligned to gbrain. Skipped in --json mode for machine consumers. On conflicts, prints how to inspect and patch. 5. **migrate-fence stdout** tells the agent its routing model just changed (fence gone, walk frontmatter now) and points at scrub-legacy-fence-rows as the eventual cleanup. References the new _AGENT_README for fresh-install agents. Smoke verified end-to-end: 16 files land (was 15, +1 for _AGENT_README), hint prints with absolute path, readme lands on disk. Tests + verify gate pass clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(skillpack): upgrade-time reference sweep + reference --since version filter (DX deferred items) Closes the last two DX gaps from the v0.36 audit: 1. **Post-upgrade reference sweep.** New `postUpgradeReferenceSweep` helper called at the end of `gbrain post-upgrade`. After migrations apply, auto-runs `reference --all` against the detected host workspace and prints a one-line-per-skill summary of drift. Five gates: GBRAIN_SKIP_REFERENCE_SWEEP env-var bypass, no detected workspace (silent), workspace IS gbrain repo (dev-mode silent), zero drift (silent), and pure-missing skills the host never scaffolded are filtered out as noise. All errors swallowed — never blocks post-upgrade. Helper accepts test-seam opts (gbrainRoot, targetWorkspace) for unit testability. 2. **`reference --all --since <version>`.** Filters the sweep to skills whose source actually changed in gbrain between <version> and HEAD, using a new `changedSlugsSinceVersion` helper in bundle.ts. Pure-JS git wrapper (spawnSync), no deps. Accepts bare '0.X.Y.Z' or 'v0.X.Y.Z' or commit SHA. Falls back loudly to full sweep when git can't resolve the ref (tarball install, missing tag). Test coverage added — total +32 new test cases: UNIT (15 cases): - test/skillpack-changed-since-version.test.ts (9 cases): git-aware filter against a fixture git repo. Covers null on non-repo, null on bad tag, empty array on no changes, single + multi-slug drift (deduped + sorted), bare + v-prefix version forms, non- skills/ path filtering, SHA-prefix ref form. - test/upgrade-reference-sweep.test.ts (6 cases): gate logic. Covers env-var bypass, zero drift, empty-host suppression, drift-detected output shape, dev-mode workspace==gbrain guard, error-swallowing contract. E2E (8 new cases in test/e2e/skillpack-flow.test.ts): - 10: scaffold lands skills/_AGENT_README.md - 11: scaffold stdout prints the Next: hint - 12: scaffold re-run (skipped-existing) suppresses the hint - 13: reference stdout prints per-category decision policy - 14: --apply-clean-hunks WARNING on stderr, not stdout - 15: --apply-clean-hunks --json suppresses the WARNING (bug fix surfaced here: code originally printed unconditionally, now gated on !json) - 16: migrate-fence stdout points at the new routing model - 17: --since with a bad tag falls back to full sweep with warn Local sweep: 579/579 pass across 18 affected test files, verify gate EXIT=0, llms regenerated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(README): zero-base rewrite — 921 → 422 lines, refreshed catalog, MECE structure The README had drifted into a changelog dumping ground. Four 'New in vX.Y' paragraphs competed for the lead, 16 version tags scattered through headings, the production-numbers hook (17,888 pages, 4,383 people) was six months stale, and skills were described in three places (Skills section, Commands section, inline marketing prose). Zero-based rewrite: **Refreshed catalog** (surveyed live brain + live agent fork, broad strokes per CLAUDE.md privacy rules): - ~100K total brain items (was 17,888 in the old README — 6x stale) - ~16K people (was 4,383) - ~5K companies (was 723) - ~8K concepts, ~4K originals, ~3.5K daily notes - ~31K media (30K tweets, 179 books, papers/films/games/interviews) - 108 cron jobs running (was 21) - 273 skills in the live agent fork (35 bundled + 238 user-built) **Structure** — MECE, single source of truth per concept: 1. Hook + at-a-glance table (refreshed numbers) 2. Install (3 paths, terse) 3. What it does (5 capability areas — replaces 12 scattered sections) 4. Skills (categorized one-liners — 35 lines, was ~200) 5. How it works (one coherent flow — replaces 4 overlapping sections: Architecture, Knowledge Model, Knowledge Graph, Search, Why It Works) 6. Commands (terse cheatsheet — every command, one line each) 7. Docs (link map — points to docs/ for the heavy stuff) 8. Origin / Contributing / License **Cut entirely** (moved or deleted): - 4 'New in vX.Y' leads (→ CHANGELOG.md is the changelog) - 16 (vX.Y) version tags in section headings - Minions stats subsection (subsumed into hook + 'durable background work') - Voice section (was 12 lines of brand prose) - Engine Architecture detail (→ docs/architecture/) - File Storage section (→ docs/guides/storage-tiering.md) - Per-skill marketing prose (one-liner per skill in the table) The README is no longer the changelog. Future releases append to CHANGELOG.md; the README only changes when a structural capability does. llms-full.txt regenerated. Privacy check + verify gate pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(README): fix line-start '+' rendering bug + lead with eval evidence Two fixes in one: 1. **Markdown bug fix.** The OAuth 2.1 paragraph had `+ PKCE,` on a line start (column 1), which GitHub-flavored markdown interprets as a list marker — the line break before it broke the paragraph and rendered as an orphan first line followed by a bullet. Rewrote the OAuth 2.1 capabilities as inline-comma-separated, escaped the `+` semantics. Swept the whole file for the same bug class — no other instances. 2. **Maximum-sell mode for evals.** Surveyed every published benchmark in both this repo and ~/git/gbrain-evals. Strongest evidence pulled to the top: - **97.60% R@5 on the public LongMemEval _s (500 questions).** No LLM in the retrieval loop. $0.50 per 1000 queries. Beats MemPalace raw by a point on the same dataset, beats every academic dense retriever (Stella, Contriever, BM25). Mastra/Supermemory measure a different metric (QA accuracy with LLM judge) — flagged honestly. - **+31.4 points P@5 from the self-wiring knowledge graph** on BrainBench v0.20.0 (240-page rich-prose corpus, 145 relational gold queries). Separable, measured, load-bearing. Zero retrieval regression across seven releases (v0.16 → v0.20). New '## Benchmarks' section after Install: - Public benchmark table with cross-system comparison - In-house BrainBench scorecard with per-adapter Δ vs gbrain - Source-swamp resistance result (93.3% top-1 vs 80% grep-only) - Skill/prompt compression: 25KB → 13KB AGENTS.md, +13-17pp accuracy across Opus 4.7 / Sonnet 4.6 / Haiku 4.5 - 'Run your own evals' subsection with copy-pasteable commands for every eval surface (longmemeval, cross-modal, eval capture/replay, BrainBench) Tightened the lead's cost-comparison claim to what's defensible per the underlying eval doc (MemPal LLM-rerank $0.001/q vs gbrain $0.0005/q; dropped the overstated '6x' I'd written initially). Privacy + verify gate + build-llms test all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(README): integrate the eval story into the lead, move jargon into 'Receipts on the evals' Previous lead dumped metric acronyms (R@5, P@5, P@5 deltas, MemPalace, Stella, Contriever, BM25) before the reader knew what gbrain does. A 'somewhat technical' reader hits the wall of jargon and bounces. Rewritten: **Lead (jargon-free, 3 paragraphs)** — describes the value in plain English, with two anchor numbers: - 'right answer in top 5 results 97.6% of the time' (not 'R@5 97.60%') - 'roughly 4x more relevant than plain vector RAG' (not '+31.4 pts P@5') - 'better than every comparable system that doesn't pay for a language- model call on every retrieval' (the load-bearing honest framing, without naming the competitors mid-hook) - ends with '[Receipts on the evals →]' linking down **'## Benchmarks' renamed '## Receipts on the evals'** with a glossary at the top defining R@5, P@5, and 'no LLM in the loop' in one line each. Then the full tables: LongMemEval cross-system (with the metric-mismatch flag for Mastra/Supermemory), in-house BrainBench scorecard, source-swamp resistance, and prompt compression. The competitor names + metrics stay here where readers who want the receipts can find them, with the glossary so the acronyms don't tax cold readers. Net: lead reads as 'here's what it does and the proof' instead of 'here are the benchmark numbers, figure out what they mean.' Comparison facts unchanged. Privacy + verify gate + build-llms test all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(README): name LongMemEval explicitly + first-person voice in lead Two specific edits from user feedback: 1. 'the standard public benchmark for AI memory systems' → 'LongMemEval' (linked to the HuggingFace dataset). The benchmark has a name; use it. 2. 'Built by the President and CEO of Y Combinator to run his own AI agents' (passive third-person) → 'I'm the President and CEO of Y Combinator, and I use this 16 hours a day' (active first-person). Carried the voice change through the rest of the README — the downstream 'Garry's personal agent' line and the Origin section's 'Garry Tan needed... he'd ever drafted... so he built one' all flip to first person ('my personal agent', 'I needed', 'I'd ever drafted', 'so I built one'). The README is now consistently first-person from the author's voice instead of a hagiographic third-person framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(README): add Multi-player and company brains section Three deployment patterns documented: 1. Single GBrain server + thin MCP clients (recommended). Tailscale private networking, OAuth scope, source-scoped clients, exhaustive what-clients-can/cannot-do lists. 2. Local PGLite + GStack for per-worktree code search. 3. Federated repos (advanced) — multiple servers indexing the same brain repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(README): tighten install path + add tech-orientation block + visceral query example Self-eval as a cold reader surfaced four gaps blocking a 10/10 first read: 1. Lead never says WHAT it is technically — CLI? service? cloud? local? Added a "What it is, technically" block right after the hook: open-source MIT, Bun CLI + MCP server, local-first, data stays on disk, MCP-native. 2. Install path optimized for committed users not evaluators. The old "recommended" path (deploy OpenClaw on Render, 8GB RAM) blocked anyone trying gbrain for the first time. Reordered into 3 paths by commitment: 60-second standalone CLI first, MCP for Claude Code / Cursor second, full agentic install third. 3. No example output showing what success looks like. Added a real sample `gbrain query` invocation with the hybrid-search result format so a reader can feel the experience before they install. 4. Privacy / data-locality unaddressed in lead. Now stated up front: embedding calls only hit external APIs if you configure them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… wrong (garrytan#1139) * schema: v0.36.0.0 Hindsight calibration tables (migrations v67-v71) Foundation commit for the Hindsight-inspired calibration wave. Adds four new tables + one perf index, all source-scoped from day 1 per v0.34.1 discipline: - calibration_profiles (v67): per-holder LLM-narrative aggregation of TakesScorecard data. published BOOL gates E8 cross-brain mount sharing (default false). grade_completion REAL surfaces partial-grade state to the dashboard. active_bias_tags TEXT[] with GIN index feeds E3 (calibration- aware contradictions) and E7 (real-time nudge matching). - take_proposals (v68): propose_takes phase queue. Idempotency cache via (source_id, page_slug, content_hash, prompt_version) unique index mirrors the v0.23 dream_verdicts pattern. proposal_run_id supports --rollback by run. dedup_against_fence_rows JSONB audit column records what canonical takes the LLM was told to dedupe against at proposal time. - take_grade_cache (v69): grade_takes verdict cache. Composite PK on (take_id, prompt_version, judge_model_id, evidence_signature) — prompt edits OR evidence changes cleanly invalidate prior verdicts. applied=false default + auto-resolve-off-by-default (D17) means every fresh install needs operator opt-in before grade verdicts mutate the takes table. - take_nudge_log (v70): E7 nudge cooldown state. Polymorphic FK — a nudge fires on either a canonical take OR a pending proposal (CDX-5 fix). CHECK constraint enforces exactly-one-set. channel column lets future routing (webhook, admin SPA toast) reuse the same cooldown semantics. - takes_resolved_at_idx (v71): partial index for the Brier-trend aggregation queries. Engine-aware handler — Postgres uses CONCURRENTLY to avoid the ShareLock; PGLite uses plain CREATE. Every table carries wave_version TEXT NOT NULL DEFAULT 'v0.36.0.0' so the v0.36.0.0 calibration --undo-wave command (lands later in the wave) can reverse just this wave's writes. Plan: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md covers the design rationale (D17/D18/D21 + CDX findings). Schema parity: - src/schema.sql for fresh Postgres installs - src/core/pglite-schema.ts for fresh PGLite installs - src/core/schema-embedded.ts auto-regenerated from schema.sql - src/core/migrate.ts for upgrade-in-place from older brains VERSION bumped to 0.36.0.0 for the wave. CHANGELOG entry lands at /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * core: BaseCyclePhase abstract class enforces source-scope + budget contracts D21 from the eng review. Three new v0.36.0.0 cycle phases (propose_takes, grade_takes, calibration_profile) share enough structure that the duplication-vs-abstraction trade tips toward a shared base. Without this scaffold, source-isolation discipline would drift exactly the way it drifted in v0.34.1 — except this time across three new surfaces at once. What this enforces: 1. Phase signature is uniform: run(ctx, opts) → PhaseResult. 2. ctx.sourceId / ctx.auth.allowedSources MUST be threaded through every engine call. The base class surfaces a scope() helper that wraps sourceScopeOpts(ctx) and is the only sanctioned way to read source- scoped data. Forgetting to thread source scope becomes a TypeScript compile error, not a runtime leak. Closes the v0.34.1 leak class structurally for every new phase. 3. Budget meter wraps run() automatically. Subclass declares budgetUsdKey + budgetUsdDefault; base reads the resolved cap from config and creates the BudgetMeter. Subclass calls this.checkBudget() before each LLM submit; budget-exhausted phase still returns status='ok' (clean abort) so the cycle report shows partial completion, not failure. 4. Error envelope is uniform. Thrown errors get caught and converted to status='fail' with a phase-specific error.code via the subclass's mapErrorCode() hook. 5. Progress reporter integration. Base accepts the reporter via opts; subclasses call this.tick() instead of touching the reporter directly, so the phase name in the progress stream is always correct. Tests: 13 cases in test/core/base-phase.test.ts cover source-scope threading (5 cases including the empty-allowedSources-MUST-NOT-widen-scope regression), PhaseResult shape including the error envelope path (3 cases), dry-run propagation (2 cases), and budget meter construction (3 cases including config-key override). Synthesize.ts / patterns.ts (existing pre-v0.36 phases) deliberately do NOT retrofit to this base in v0.36.0.0 — too much churn for a refactor that doesn't pay off until v0.37+. Future phases use this by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: propose_takes phase + take_proposals queue write path (T3) LLM-based take extraction from markdown prose. Walks pages updated since last cycle, sends each page's body to a tuned extractor, writes the extracted gradeable claims to the take_proposals queue. User accepts / rejects via `gbrain takes propose --review` (lands in Lane C). Cycle wiring: lint → backlinks → sync → synthesize → extract → extract_facts → resolve_symbol_edges → patterns → recompute_emotional_weight → consolidate → propose_takes (NEW) → grade_takes (NEW; T4) → calibration_profile (NEW; T6) → embed → orphans → purge CyclePhase enum extended with 3 new entries; ALL_PHASES + NEEDS_LOCK_PHASES updated. All three new phases acquire the cycle lock (writes to take_proposals / take_grade_cache / calibration_profiles). Idempotency contract: The (source_id, page_slug, content_hash, prompt_version) composite unique index on take_proposals means an unchanged page never re-spends LLM tokens. Bumping PROPOSE_TAKES_PROMPT_VERSION cleanly invalidates the cache so a tuned prompt re-runs proposals on every page. Mirrors the v0.23 dream_verdicts pattern. F2 fence dedup: The phase reads the page's existing `<!-- gbrain:takes:begin -->` fence (when present) and passes the canonical take rows to the extractor as "things you have already captured." Prevents duplicate proposals when prose is appended to a page that already has takes. Records the fence rows the LLM was told to dedupe against on the take_proposals row for audit (dedup_against_fence_rows JSONB). Auto-resolve posture: propose_takes only WRITES proposals to the queue. Nothing in this phase mutates the canonical takes table. Operator opt-in via the queue review CLI (Lane C) is the only path from queue to canonical fence (D17). Prompt tuning status (v0.36.0.0 ship state): The default extractor prompt is annotated `v0.36.0.0-stub`. The real tuned prompt arrives via T19 synthetic corpus build (50 anonymized pages, 3-model parallel extraction, user reviews disagreement set, F1 ≥ 0.85 on training corpus + F1 ≥ 0.8 on ground-truth holdout). Until T19 lands, propose_takes runs but produces best-effort candidates the user reviews manually. Architecture: ProposeTakesPhase extends BaseCyclePhase (T2). Inherits source-scope threading via scope(), budget metering via this.checkBudget(), error envelope wrapping. budgetUsdKey: cycle.propose_takes.budget_usd (default $5/cycle). Budget exhaustion mid-page returns status='warn' with details.budget_exhausted=true — clean partial-completion semantics. Test seam: opts.extractor injection so the phase can run hermetically without touching the gateway. defaultExtractor (production path) calls gateway.chat with the EXTRACT_TAKES_PROMPT and parses the JSON array output via parseExtractorOutput. parseExtractorOutput defends against common LLM output sins: markdown code fence wrapping, leading prose, single-object instead of array, unknown kind values, weight out of [0,1], rows missing claim_text or exceeding 500 chars. Tests: 25 cases in test/propose-takes.test.ts cover the 4 pure helpers (parseExtractorOutput, contentHash, hasCompleteFence, extractExistingTakesForDedup) + 7 phase integration scenarios (happy path, cache hit, fence dedup, extractor failure, empty pages, skipPagesWithFence, proposal_run_id stability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: grade_takes phase + take_grade_cache verdict pipeline (T4) Walks unresolved takes that are old enough to have outcome data, retrieves evidence from the brain, asks a judge model to verdict each one. Writes verdicts to take_grade_cache. Optionally — only when operator has flipped the opt-in config flag — auto-applies high-confidence verdicts to the canonical takes table via engine.resolveTake. Auto-resolve posture (D17 — DISABLED by default): On a fresh install, grade_takes runs and writes verdicts to the cache, but applied=false on every row. Operator reviews the queue, then flips `cycle.grade_takes.auto_resolve.enabled: true` once trust is earned. Mirrors the propose_takes review-queue posture: queue exists, mutation requires explicit opt-in. Conservative threshold (D12): When auto_resolve.enabled is true, a verdict auto-applies only when confidence >= 0.95 (single-judge path). T5 ensemble path lands next, tightening this further with 3/3 unanimous requirement. 'unresolvable' verdict NEVER auto-applies even at confidence=1.0 — there's no canonical column for "we tried and there's no evidence yet." Evidence retrieval status (v0.36.0.0 ship state): The default evidence retriever returns an "evidence-retrieval not yet wired" placeholder. Most verdicts produced by the stub-judge against the stub-evidence will be 'unresolvable'. Real retrieval (hybrid search over pages newer than the take's since_date, optionally augmented by a gateway web-search recipe in v0.37+) lands as a follow-up. Documented limitation per CDX-8 + D17 — the phase ships now so the wiring is real and the cache table accumulates verdicts even if early ones are conservative. Cache key: Composite primary key on take_grade_cache is (take_id, prompt_version, judge_model_id, evidence_signature). Prompt edits OR evidence changes OR judge swap cleanly invalidate prior verdicts. Mirrors the v0.32.6 eval_contradictions_cache pattern. evidence_signature = SHA-256 of (judge_model_id + '|' + evidence_text) so identical evidence under a different judge does NOT collide. Architecture: GradeTakesPhase extends BaseCyclePhase. Inherits source-scope threading, budget metering (cycle.grade_takes.budget_usd, default $3/cycle), error envelope. Test seam: opts.judge + opts.evidenceRetriever injection so the phase runs hermetically. parseJudgeOutput defends against fence-wrapping, leading prose, out-of-range confidence (clamps to [0,1]), invalid verdict labels, oversized reasoning (truncated at 400 chars). Returns null on unrecoverable parse — caller treats null as "judge_output_parse_failed / unresolvable at confidence 0.0" so the row still lands in cache with the parse failure surfaced via warnings. takeIsOldEnough gates on since_date (default 6 months). Tolerates YYYY-MM-DD and YYYY-MM formats. Returns false on null/unparseable since_date so takes without dates never get graded (we'd be hallucinating temporal context). Tests: 23 cases covering parseJudgeOutput (7 cases), evidenceSignature (3), takeIsOldEnough (5), and 8 phase integration scenarios — happy path, D17 auto-resolve-off default, D12 above-threshold auto-apply, below- threshold cache-only, unresolvable-NEVER-applies, cache hit, too-recent gate, judge-throw warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: grade_takes ensemble tiebreaker for borderline verdicts (T5 / E2) Multi-judge ensemble tiebreaker, additive on top of T4's single-judge foundation. Reuses gateway.chat as the per-model judge interface; runs three judges in parallel via Promise.allSettled. Pure aggregation logic in aggregateEnsemble() — no SQL, no LLM, hermetically testable. When ensemble fires (T5 trigger band): Only when ALL of: - opts.useEnsemble === true (default false) - opts.ensembleJudges array is non-empty - single-model confidence in [0.6, 0.95) (configurable via opts.ensembleTriggerBand) - single-model verdict !== 'unresolvable' Above 0.95 the single judge is already sufficient (T4 path). Below 0.6 the verdict is clearly review-only — ensemble wouldn't change the posture. 'unresolvable' from single-judge means no evidence yet; calling three more judges on the same evidence won't manufacture some. Conservative auto-apply (D12): Ensemble verdict auto-applies via engine.resolveTake only when ALL of: - autoResolve === true (operator opt-in per D17) - ensemble.agreement === 3 (3/3 unanimous) - ensemble.minConfidence >= ensembleThreshold (default 0.85) - winning verdict !== 'unresolvable' Schema-level monotonic-tightening guard for ensembleThreshold lives in the takes resolution layer. Cache identity: When ensemble fires, the cache row's judge_model_id becomes 'ensemble:<modelA>+<modelB>+<modelC>' — a future re-run with different ensemble membership doesn't collide with prior verdicts. evidence_signature is recomputed because it includes the judge_model_id. aggregateEnsemble (pure): - 3/3 unanimous → agreement=3, minConfidence=min across the three - 2/3 majority → agreement=2, minConfidence across the agreeing two - 1/1/1 disagreement → tie-break: prefer non-'unresolvable', then alphabetical for determinism - 'unresolvable' from one model NEVER tips a 2-vote majority toward 'unresolvable' — by-label tally only counts a model toward its own label - All three judges failing (allSettled rejected) → verdict='unresolvable' with agreement=0; auto-apply path blocked - Single judge survives + two fail → agreement=1; the lone verdict wins but auto-apply gated by the 3/3 requirement Tests: 16 cases. aggregateEnsemble (6): 3/3, 2/3, 1/1/1, unresolvable-tipping-resistance, all-failed, partial-failed-but-survives. Phase trigger conditions (5): useEnsemble=false default, useEnsemble=true in borderline band, single >= 0.95 skip, single < 0.6 skip, single = 'unresolvable' skip. Phase auto-apply rules (5): 3/3+threshold+autoResolve, 2/3 majority no apply, 3/3 below threshold no apply, one ensemble judge throws still aggregates from allSettled, empty ensembleJudges falls through to single. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: calibration_profile phase + shared voice gate across surfaces (T6) The calibration narrative layer. Reads TakesScorecard, asks an LLM to write 2-4 conversational pattern statements ("right on tactics, late on macro by 18 months"), passes them through the voice gate, derives active bias tags, writes the row to calibration_profiles. This is the read-side that E1 (think anti-bias rewrite), E3 (contradictions join), E6 (dashboard), and E7 (real-time nudges) all consume. Voice gate (D24 — single function, multiple surfaces): ALL five calibration UX surfaces import the same gateVoice() function from src/core/calibration/voice-gate.ts. Mode parameter ('pattern_statement' | 'nudge' | 'forecast_blurb' | 'dashboard_caption' | 'morning_pulse') drives surface-specific tuning via the rubric the gate ships to its Haiku judge. NO forked implementations — voice rubric drift would defeat the gate. Each mode's rubric explicitly forbids preachy / clinical / corporate voice; a structural test pins this. Anchors the cross-cutting voice rule from /plan-ceo-review D2-D8. Fallback policy (D11): Up to 2 generation attempts (configurable). On both rejects → fall back to a hand-written template from src/core/calibration/templates.ts. Templates are intentionally short and a little "robotic" — they're the safety net, not the destination. voice_gate_passed=false + voice_gate_attempts get persisted on the calibration_profiles row so the operator can review the failing examples and tune the rubric over time. Suppressing the surface silently is NEVER an option — that's how voice quality silently degrades. parseJudgeOutput defaults to 'academic' on parse failure (NEVER passes pass-through) so a Haiku output garble falls through to the template rather than letting unverified text reach the user. calibration_profile phase: Extends BaseCyclePhase. Cold-brain skip: <5 resolved takes → no row written, no LLM call. Otherwise: scorecard via engine.getScorecard() → patterns via voice-gated generator → bias tags via separate generator (best-effort; failure logs warning, phase continues). The DB INSERT lands in the v67 calibration_profiles row with source_id, holder, the patterns, voice gate audit fields, active bias tags, and grade_completion (F1 fix — partial-grade state surfaces to the dashboard "60% graded" badge). Budget gate at $0.50/cycle default (mostly Haiku). Below-budget before-LLM-call check returns status='warn' without writing the row. Per-domain scorecards are a placeholder for v0.36.0.0 ship state — the F12 batchGetTakesScorecards() engine method that powers per-domain rendering lands in Lane C alongside the CLI/MCP surface. Architecture: parsePatternStatementsOutput is tolerant of LLM emitting numbered lists / bulleted lines despite the prompt asking for plain lines. Caps at 4 patterns + drops excessively long lines (>200 chars). parseBiasTagsOutput lowercases input + drops non-kebab-case tokens (defends against the LLM emitting "Over-Confident Geography" with spaces or capitals). Caps at 4 tags. Tests: 43 cases across two new test files. voice-gate.test.ts (24): parseJudgeOutput (7), gateVoice happy path (3), fallback path (5), mode parity (2), templates (7). calibration-profile.test.ts (19): parsers (10), pickFallbackSlots (3), phase integration (6 — cold-brain skip, happy path, voice gate fallback, grade_completion plumbed through, bias-tags failure non-fatal, source_id scope reaches INSERT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cli: gbrain calibration + get_calibration_profile MCP op (T7) Public-facing read surface for the v0.36.0.0 calibration wave. CLI prints the active calibration profile; MCP op exposes the same data path for agents. Mirror of the v0.29 salience/anomalies shape (pure data fn + JSON formatter + human formatter + thin CLI dispatch). CLI: `gbrain calibration` Flags: --holder <id> specific holder (default 'garry') --json machine output for piping --regenerate run calibration_profile phase now --undo-wave <ver> [placeholder — wires in Lane D / T17] ab-report [placeholder — wires in Lane D / T18] Human output: Calibration profile — holder: garry, source: default Generated: <local timestamp> [Note: built on 60% graded — partial completion this cycle.] (when grade_completion < 0.9) [Note: voice gate fell back to template (2 attempts).] (when voice_gate_passed=false) Resolved: 12 takes Brier: 0.210 (lower is better) Accuracy: 60.0% Partial: 10.0% Pattern statements: • You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography Cold-brain fallback message names the exact dream command to run. MCP: `get_calibration_profile` (scope: read) Param: holder?: string (defaults to 'garry') Returns: latest CalibrationProfileRow | null Source-scoping via sourceScopeOpts(ctx): scalar source-bound clients see only their source; federated_read scopes see the union of allowed sources; no source filter when neither is set (CLI default path). Throws GBrainError('INVALID_HOLDER') on empty/non-string holder so remote callers get a structured error instead of a SQL-shape failure. Architecture: getLatestProfile is the pure data fn — engine + opts → CalibrationProfileRow | null. Reused by both the CLI and the MCP op. Source-scoped via the standard v0.34.1 spread pattern (scalar sourceId vs sourceIds array). formatProfileText is pure — null → cold-brain message, populated → full printout. Annotates partial-grade rows and voice-gate-fallback rows so the operator sees data-quality status inline. parseArgs is exported via __testing for unit coverage. Sub-command ('ab-report') vs flag distinction is intentional — keeps the surface parallel with `gbrain eval cross-modal` etc. Tests: 21 cases. parseArgs (6 cases): empty, --holder, --json, --regenerate, --undo-wave, ab-report. getLatestProfile (5 cases): happy, null, scalar source scope, federated array scope, no-source-filter default. formatProfileText (5 cases): cold-brain, happy, partial-grade note, voice-fallback note, published-to-mounts note. getCalibrationProfileOp (5 cases): default holder, scalar source scope, federated scope union, returns-null-on-unknown-holder, throws on empty holder. Lane D follow-ups: --undo-wave (T17) and ab-report (T18) print a clear "lands in Lane D" stderr line + exit 2; the surfaces exist for early testers, the implementations land next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * think: --with-calibration + anti-bias prompt rewrite (T8 / E1, D22) Optional anti-bias rewrite mode for `gbrain think`. When set, the active calibration profile gets injected per the D22 placement spec (AFTER retrieval evidence, BEFORE the user's question). The bias filter applies to QUESTION FRAMING, not evidence interpretation — matches LLM-as-judge best practice (bias prompts near end of context perform better). Default behavior unchanged (R1 regression guard): omitting --with-calibration produces the v0.28-vintage user-message shape with the question first, then retrieval. Existing think users see no change. Two user-message shapes in buildThinkUserMessage: Default (no calibration): Question: X <pages>...</pages> <takes>...</takes> <graph>...</graph> Respond with a single JSON object... With calibration (D22): <pages>...</pages> <takes>...</takes> <graph>...</graph> <calibration holder="garry"> Track record: Brier 0.210 (lower is better). Active patterns: - You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography </calibration> Question: X Respond... Calibration block is built by buildCalibrationBlock (exported for the E3 contradictions probe to render the same shape). System prompt extension (withCalibration:true): - Names BOTH the user's PRIOR (default reasoning) AND the COUNTER-PRIOR from their hedged-domain self. - References active bias tags by name when relevant ("this fits the over-confident-geography pattern"). - Does NOT silently substitute the debiased answer. ALWAYS surfaces both priors transparently. - Adds a "Calibration" section between Conflicts and Gaps in the answer body. RunThinkOpts extension: - withCalibration?: boolean — opt-in - calibrationHolder?: string — defaults to 'garry' When withCalibration=true and no profile exists, runThink falls back to baseline behavior + pushes NO_CALIBRATION_PROFILE to warnings (visible to the operator). When the calibration fetch fails, CALIBRATION_FETCH_FAILED warning surfaces with the underlying error. Either path keeps think working; the calibration loop is enhancement, not requirement. CLI: `gbrain think "<q>" --with-calibration [--calibration-holder <id>]` Tests: 11 cases. buildThinkSystemPrompt (4 cases): R1 regression — default/false/omitted → no anti-bias rules; with calibration → adds PRIOR + COUNTER-PRIOR + bias-tag reference; preserves existing hard rules. buildCalibrationBlock (3 cases): happy path, null brier omitted (not "Brier null"), empty patterns + tags still well-formed. buildThinkUserMessage (4 cases): R1 regression — without calibration: question first; D22 placement — retrieval → calibration → question → instruction; graph + calibration ordering; empty retrieval blocks render placeholders without breaking shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * contradictions: calibration-profile join (T9 / E3) Cross-references each contradiction finding against the active calibration profile. When a contradiction's domain matches an active bias tag (e.g. "over-confident-geography" or "late-on-macro-tech"), the output gains a one-line bias context explaining which pattern this fits. Pure functions only — no DB writes, no LLM calls. The probe runner imports tagFindingWithCalibration() and applies it to each finding before emitting. When no profile exists or no tags match, the helper returns null and the runner emits the unchanged finding (regression R2 — contradictions output is byte-identical to v0.32.6 when no calibration profile is present). Match heuristic (v0.36.0.0 ship-state): Bias tags are kebab-case axis-then-domain slugs ('over-confident-geography'). computeDomainHint() extracts a domain hint from the finding's slugs + holder + verdict text: - wiki/companies/... → hiring | market-timing - wiki/people/... → founder-behavior - macro / geography / tactics / ai segments in slug → matching tag First-match-wins for ordering determinism. Match is intentionally fuzzy — the v0.32.6 contradictions probe doesn't yet carry structured domain metadata. v0.37+ structured-domain-on-takes (Hindsight-style enum) tightens this. Output: Returns { bias_tag: string, context: string } | null. Context format: "This contradiction fits your active bias pattern \"<tag>\" (Brier 0.31). Verdict: contradiction; severity: medium. Consider reviewing both sides through the lens of that pattern." Tests: 13 cases. R2 regression (2): null profile → null tag; empty active_bias_tags → null tag. computeDomainHint (5): companies / people / macro / geography / unknown paths produce expected hints. Match path (4): macro→late-on-macro-tech, geography→over-confident-geography, mismatch returns null, first-match-wins with multiple candidate tags. buildBiasContextString (2): emits tag+verdict+severity+Brier; omits Brier when null (no "Brier null" leak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: Brier-trend forecast at write time (T10 / E5) Pure math layer over existing TakesScorecard data. Zero new LLM cost, zero new schema. Surfaces the user's historical Brier for the take's (holder, domain) bucket at write time so they see "your historical Brier in macro takes is 0.31" before committing the take. Voice-gate-rendered output: The user-facing string goes through gateVoice mode='forecast_blurb' via templates.ts (already in T6). This module is the pure data layer; the template renders the math into the conversational voice. v0.36.0.0 ship state: Bucket dimension is the DOMAIN (slug-prefix). The conviction-weight bucket dimension would need a new engine method (engine.batchGetTakeBucketStats per F11) — deferred to v0.37+. Until then, forecast = historical Brier in this holder's domain. resolveDomainPrefix() keeps slug-prefix-looking domain hints ('companies/', 'wiki/macro') and falls back to overall for free-form hints ('macro tech', 'geography'). Hindsight-style structured domain on takes (CDX-11 mitigation TODO) tightens this in v0.37+. MIN_BUCKET_N = 5: Below this sample size, the forecast returns predicted_brier=null with insufficient_data=true. Template renders "Forecast unavailable: only N resolved takes at this conviction yet" instead of a noisy estimate. Architecture: computeForecast(input) — pure function, takes scorecards already fetched; ideal for tests + reuse across batched paths. forecastForTake(engine, input) — convenience wrapper, 1-2 engine round-trips (no domain → 1; with domain → 2). batchForecast(engine, inputs[]) — memoizes per (holder, domainPrefix); N inputs collapse to ≤2*unique_holders unique engine calls. Used by the propose-queue review flow (50 candidates → 1-2 scorecard fetches). Tests: 14 cases. computeForecast (4): insufficient_data branch, stable forecast, overall fallback, MIN_BUCKET_N export. resolveDomainPrefix (5): undefined/empty/whitespace → undefined; slug-prefix → kept; free-form → undefined. forecastForTake (3): 1-call overall, 2-call domain, free-form fallback. batchForecast (2): cache collapse for repeat queries; different holders do not collapse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: gstack-learnings coupling on incorrect resolutions (T11 / E4) When the grade_takes phase auto-resolves a take as 'incorrect' or 'partial', optionally write a learning entry to gstack's per-project learnings.jsonl so other gstack skills (plan-ceo-review, ship, investigate, ...) can pull it as context when relevant. The brain teaches every other tool about the user's track record. Config gate (D5 / CDX-17 mitigation): `cycle.grade_takes.write_gstack_learnings` defaults FALSE. External users may not have gstack installed; the gstack-learnings binary API isn't stable yet. Garry's brain flips it true to opt in. Quality gate: Only 'incorrect' and 'partial' verdicts trigger the write. 'correct' resolutions are noise (we expected the take to hold up — no learning). 'unresolvable' has no canonical column. Defense-in-depth runtime guard in writeIncorrectResolution() rejects ineligible qualities with reason='quality_not_eligible' so a caller misuse never surfaces a malformed learning entry. Auto-apply only: Coupling fires only when grade_takes both auto-applies AND the verdict is incorrect/partial AND the config flag is enabled. Manual resolutions via `gbrain takes resolve` intentionally DO NOT propagate to gstack — manual writes already carry operator intent; the calibration loop is the noise-prone path that earns coupling. Namespace: Every entry's key starts with 'gbrain:calibration:v0.36.0.0:'. Lane D `gbrain calibration --undo-wave v0.36.0.0` (T17) filters on this prefix for the optional gstack-scrub step. First active bias tag suffixes the key (e.g. 'take-42:over-confident-geography') so future analysis can group learnings by bias pattern. Architecture: buildLearningEntry — pure. Truncates claim at 200 chars + ellipsis; emits Pattern: line when activeBiasTags present; defaults confidence to 0.8 when caller omits it. writeIncorrectResolution — async wrapper. Honors config gate; honors quality gate; calls the injected writer (or defaultGstackWriter in production). Failures are non-fatal: returns { written: false, reason: 'write_failed' | 'binary_missing', error }. The grade_takes phase logs to result.warnings and continues — gstack coupling failure NEVER aborts a cycle. defaultGstackWriter — shells out to gstack-learnings-log binary via execFileSync. Throws GBrainError('GSTACK_BINARY_NOT_FOUND') when the binary isn't on PATH; writeIncorrectResolution classifies that error to reason='binary_missing' so the operator sees the install hint instead of a generic write_failed. Wired into grade-takes.ts after engine.resolveTake() inside the auto-apply block. Only fires when shouldApply=true. Tests: 14 cases. buildLearningEntry (7): canonical shape, partial vs incorrect wording, bias-tag suffix, no-tag fallback, claim truncation, default confidence, no-reasoning omission. writeIncorrectResolution (7): config gate, quality gate, happy path, writer-throw graceful degrade, binary-missing classification, async writer awaited, partial quality writes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * doctor: 4 calibration checks — abandoned/freshness/drift/voice (T12) Adds the four calibration doctor checks per the eng-review spec. abandoned_threads: Counts active high-conviction takes (weight >= 0.7) older than 12 months that have never been superseded. Signal, not error — always status='ok' with a count. The hint sends users to `gbrain calibration` for details. calibration_freshness: Warns when the active profile is older than 7 days (configurable via the same env-var pattern other freshness checks use). Cold-brain branch (no profile yet) returns ok without scolding. Hint points at `gbrain calibration --regenerate`. grade_confidence_drift (CDX-11 mitigation): Surfaces the count of auto-applied grade verdicts. Below 30: returns "need 30+ for drift detection". At/above 30: returns "drift math arrives in v0.37+". The surface is wired; the actual confidence-vs-accuracy correlation math is a v0.37+ follow-up once we have 30+ auto-applied verdicts to measure against. Closes the CDX-11 hole structurally — the operator sees the surface even before the math is meaningful. voice_gate_health: Tracks voice gate failure rate over the last 7 days. <30% fail rate → ok (template fallback is fine in isolation). >=30% → warn with hint to review src/core/calibration/voice-gate.ts rubric. Anchors the cross-cutting voice rule observability story. All four checks return status='warn' with a diagnostic message on engine errors — non-blocking, never throws. Matches the existing doctor check pattern (see checkSyncFreshness for prior art). Wired into runDoctor after checkRerankerHealth (the v0.35 cluster), in the canonical block 10 slot. Tests: 15 cases. 4 per check (happy path, alt-status, engine-throw diagnostic, plus boundary tests for the freshness staleness gate at exactly 7 days and the grade drift gate at 30 applied verdicts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: E7 nudge + 14-day cooldown (T13 / D16 F3) Real-time pattern surfacing when a newly-committed high-conviction take matches an active bias pattern. Conversational nudge text via the templates module; 14-day cooldown per (take_id, nudge_pattern) via take_nudge_log to prevent the feedback loop where each cycle re-fires the same nudge on the same take. Threshold gates (D16 F3): - holder match (profile.holder === take.holder) - conviction-weight > 0.7 (strict greater than) - take's slug-derived domain hint matches an active bias tag (takeDomainHint — same heuristic as eval-contradictions/calibration-join.ts for cross-surface consistency) Cooldown gate: Before firing, probe take_nudge_log for (take_id, nudge_pattern) rows with fired_at >= now() - 14 days. Any hit → silently skip. After firing, insert a new row with channel='stderr' so the next 14 days are gated. Feedback-loop prevention: User hedges a take in response to a nudge (e.g. weight 0.85 → 0.65). Even though the take's `weight` field changed, the cooldown row for the over-confident-geography pattern is still there from the original fire — so the next cycle's evaluateAndFireNudge() silently skips. The user reset path (gbrain takes nudge --reset N) clears the cooldown to re-arm. Output channel (v0.36.0.0 ship state): STDERR only. Schema's `channel` column already supports multi-channel (webhook, admin SPA toast); routing those is a v0.37+ follow-up. Architecture: evaluateNudgeRule(take, profile) — pure rule check. Returns { matched, reason, matchedTag }. No engine call. checkCooldown(engine, takeId, pattern) — engine probe, returns boolean. recordNudgeFire(engine, opts) — INSERT into take_nudge_log. evaluateAndFireNudge(opts) — full pipeline. Returns NudgeDecision. resetNudgeCooldown(engine, takeId) — DELETE...RETURNING for the CLI. buildNudgeText delegates to templates.ts nudgeTemplate (D24 mode='nudge' voice). v0.36.0.0 ship state uses the template directly; LLM-generated nudge text via the voice gate lands in v0.37+ when we have production examples to tune from. Tests: 22 cases. takeDomainHint (5): companies/people/macro/geography/unrecognized. evaluateNudgeRule (6): no_profile, wrong_holder, conviction-at-threshold- is-NOT-eligible (strict >), no matching tag, happy match, first-match-wins for multiple candidate tags. checkCooldown (3): true on row hit, false on no row, cutoff date param verifies the 14-day boundary. evaluateAndFireNudge (4): happy fire (text contains hush command + matched tag), cooldown silent skip (no INSERT, no stderr), no_profile short-circuit, below-conviction short-circuit (no cooldown query fired). buildNudgeText (2): hush command shape, conviction value embedded. resetNudgeCooldown (2): returns count, idempotent on zero rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: E8 team-brain sharing + D18 cross-brain query semantics (T14) Cross-brain calibration profile resolution per the D18 4-rule contract. Pins all four cross-brain leak surfaces in dedicated unit tests so future mount features can't silently regress this security model. D18 semantics (committed): Rule 1 — LOCAL-FIRST ORDERING. Query the local brain first. If a profile exists, return it. Do NOT also query mounts (avoids stale-mount-overrides-fresh-local). Verified: mountResolver is NOT called when local has a hit. Rule 2 — MOUNT FALLBACK. Only when local has no profile AND canReadMounts=true, walk the mounts in priority order. First match wins. Each mount-side row must have published=true to be visible (D15 asymmetric opt-in). Rule 3 — CROSS-BRAIN ATTRIBUTION. Every returned profile carries source_brain_id + from_mount flag. Consumers (E1 think rewrite, E3 contradictions, E7 nudge, E6 dashboard) MUST surface this via attributionSuffix() so the user sees which brain answered. Rule 4 — SUBAGENT PROHIBITION. canReadMountsForCtx() classifier returns FALSE for subagent loops without trusted-workspace allowedSlugPrefixes. Closes the OAuth-token-to-cross-brain-leak surface — subagents see ONLY their local-brain results regardless of which holder they query. Exception: trusted cycle phases (synthesize/patterns) pass allowedSlugPrefixes set and ARE allowed to read mounts. Pinned in the classifier test. Architecture: queryAcrossBrains(localEngine, opts) — pure orchestrator. Composes getLatestProfile() from src/commands/calibration.ts. Mount engine access is via opts.mountResolver — production wires this to the v0.19+ gbrain mounts subsystem; tests inject a stub returning an ordered list of mocked engines. Decouples cross-brain LOGIC from multi-engine PLUMBING. canReadMountsForCtx(ctx) — pure classifier table. Drives the rule-4 gate. Production callers compose it from OperationContext. attributionSuffix(result) — pure formatter. Emits the "(from mounted brain: <id>)" suffix when from_mount=true; empty string when local. Mandatory for user-visible cross-brain consumers. Tests: 15 cases pinned to the 4 D18 rules + 4 supplementary structural checks. D18-1: published=false profile on mount stays hidden. D18-2/3: subagent context cannot fall back to mounts (2 cases — null on local-empty + canReadMounts=false, local hit still returned). D18-4: attribution surfaces source_brain_id (3 cases — mount answer flag, local answer flag, attributionSuffix formatter). Rule 1 local-first ordering (2 cases — mountResolver NOT called on local hit, IS called on local empty). Mount priority order (3 cases — first published=true wins, all published=false returns null, no mounts configured returns null without throwing). canReadMountsForCtx classifier (4 cases — local CLI true, MCP non-subagent true, subagent without trusted-workspace false, subagent WITH trusted-workspace true). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * admin: E6 Calibration tab + D23 server-rendered SVG + TD2 contrast bump (T15) Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review, the approved variant-B (Linear calm clarity) layout: single-column flow, generous whitespace, ONE big sparkline as hero, then patterns, then domain bars, then abandoned threads. D23 server-rendered SVG architecture: src/core/calibration/svg-renderer.ts — pure functions. data → SVG string. No DOM, no React, no chart library dep. Inlines the admin design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is visually consistent with the rest of the admin SPA. Four chart renderers: - renderBrierTrend({ series }) — sparkline w/ baseline reference at 0.25 (always-50% baseline) - renderDomainBars({ bars }) — horizontal accuracy bars per domain - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link per row, points at /admin/calibration/revisit/<takeId> - renderPatternStatementsCard(statements) — D29/TD3 clickable drill-down links per row, point at /admin/calibration/pattern/<i> XSS posture: all caller-controlled strings pass through escapeXml(). Numeric inputs are .toFixed()-coerced. Admin SPA renders via dangerouslySetInnerHTML inside a TrustedSVG wrapper component; endpoint is gated by requireAdmin middleware. /admin/api/calibration/profile — returns the active profile row as JSON. /admin/api/calibration/charts/:type — returns image/svg+xml markup for type ∈ {brier-trend, domain-bars, pattern-statements, abandoned-threads}. Cache-Control: private, max-age=60. brier-trend currently renders a single-point series from the active profile (the time-series view across calibration_profiles.generated_at history is a v0.37 follow-up once we have multiple snapshots). abandoned-threads pulls the top 5 abandoned rows via the same SQL the doctor check uses. CalibrationPage React component (admin/src/pages/Calibration.tsx): Fetches profile + 4 charts. Loading / error / cold-brain states all handled. Layout includes the audit annotations (partial-grade badge, voice-gate-fell-back-to-template badge) per the approved mockup. TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG surface only. App.tsx nav: added 'calibration' page route + sidebar nav item, hash routing extended to support #calibration. TD2 contrast bump: admin/src/index.css --text-muted: #555 → #777. Old value was contrast 4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is ~5.5, passes AA. Improvement is global across Dashboard, Agents, RequestLog, and the new Calibration tab — single-line CSS change with ~10x the impact. admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed. Tests: 19 cases in test/svg-renderer.test.ts. escapeXml (1): canonical entities. renderBrierTrend (6): empty state, polyline for 2+ points, clamp beyond yMax, design tokens inlined, XSS safety on date strings, text-anchor end on right label. renderDomainBars (4): empty state, label/accuracy/n rendering, out-of-range accuracy clamp, XSS safety on labels. renderAbandonedThreadsCard (4): empty state, row rendering with revisit link, claim truncation at 70 chars, custom revisitHref override. renderPatternStatementsCard (4): empty state, anchor count matches statement count, XSS safety, custom drillHref override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * recall: calibration footer formatter for morning pulse (T16) Pure formatter that turns a CalibrationProfileRow + optional abandoned- threads list into the conversational block the morning pulse will surface: Calibration this quarter: Brier 0.18 (solid). Right on early-stage tactics, late on macro by 18 months. Over-confident on team execution; under-calibrated on regulatory risk. Threads you opened and never came back to: · AI search platform differentiation (17 months silent) · International expansion playbook (12 months silent) Cold-brain branch: returns empty string when no profile or < 5 resolved takes. Caller decides whether to render the block; cold-brain absence is the cleanest non-event. Brier trend note maps the absolute value to conversational copy: <= 0.10 → "(strong calibration)" <= 0.20 → "(solid)" <= 0.25 → "(near baseline)" > 0.25 → "(worse than always-50% baseline — review your high-conviction calls)" v0.36.0.0 ship state has only the current profile snapshot. The "was 0.22 90d ago — improving" comparison shape arrives when we accumulate generated_at history across multiple cycles. R3 regression posture: This module is the FORMATTER only. Wiring into `gbrain recall`'s text output is intentionally NOT in this commit — runRecall's surface stays unchanged. v0.37 wires it under --show-calibration (opt-in initially, default-on later). For now the formatter is callable from the admin tab + custom CLI scripts that want it. Architecture: buildRecallCalibrationFooter(opts) — pure. opts.profile required, opts.abandonedThreads optional, opts.threadColumnWidth defaults to 50. Caps at 4 patterns + 5 abandoned threads to keep the footer scannable. Truncates long abandoned-thread claim text to fit the column width with a trailing ellipsis. Tests: 14 cases. Cold-brain branch (3): null profile, < 5 resolved, zero resolved. Happy path (7): header + Brier + patterns, trend note ranges (4 brackets), null brier omits the Brier line but keeps header, caps at 4 patterns. Abandoned threads (4): omit section when none, emit when present, cap at 5, truncate long claim with column-width override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: --undo-wave reversal command (T17 / D18 CDX-3) Implements the undo-wave reversal flow. Every new row written by the v0.36.0.0 calibration wave carries wave_version='v0.36.0.0' so a precise revert is possible without touching pre-wave data. CLI surface (replaces the v0.36.0.0 ship-state placeholder): gbrain calibration --undo-wave v0.36.0.0 [--dry-run] [--scrub-gstack] [--json] Reversal scope (4 steps): Step 1 — UNSET takes.resolved_* columns for takes auto-applied by this wave. Identifies wave-applied takes via take_grade_cache.applied=true + wave_version match. Cross-checks resolved_by='gbrain:grade_takes' to ensure we're not un-resolving a take a manual `gbrain takes resolve` override has since claimed. Manual resolutions persist; only auto-grade resolutions revert. Step 1b — Mark take_grade_cache rows applied=false post-undo so the audit trail shows they WERE applied but this wave was reverted. The CDX-11 confidence-drift check filters on applied=true and gets a cleaner sample post-undo. Step 2 — DELETE FROM calibration_profiles WHERE wave_version = ?. Step 3 — DELETE FROM take_nudge_log WHERE wave_version = ?. Step 4 — Optional gstack-learnings-prune via the binary, scoped to the GSTACK_LEARNING_NAMESPACE prefix. Opt-in via --scrub-gstack. Best-effort: binary-missing or failure logs a warning + suggests the manual command; the rest of the undo still succeeded. Dry-run posture: --dry-run computes the counts via SELECT COUNT(*) shapes without emitting any UPDATE or DELETE. Same UndoWaveResult shape returned so operator sees exactly what would be reverted before committing. --dry-run intentionally skips the gstack scrub (filesystem write) too; ship-state safety call. Idempotency: Re-running --undo-wave on a brain that's already reverted is a no-op. Each query filters on wave_version; no matching rows → zero counts. Architecture: undoWave(engine, opts) — async, returns UndoWaveResult. Pure data layer; no stderr writes, no process exits. CLI dispatch in src/commands/calibration.ts handles printing. v0.36.0.0 ship state runs steps 1-3 sequentially (no transaction). Partial reversal is recoverable via re-run since each step is idempotent on wave_version match. A future enhancement (v0.37+) can wrap in engine.transaction once that surface lands in BrainEngine. Tests: 8 cases in test/undo-wave.test.ts. Dry-run posture (1): counts emitted, NO UPDATE/DELETE SQL fired. Happy path (3): all 4 steps execute, resolved_by filter scopes UPDATE to wave-applied resolutions, custom resolvedByLabel honored. Empty wave (2): zero counts when no matching rows, idempotent re-run. Wave-version parameter threading (2): supplied version threads through all queries, different wave versions don't collide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: A/B harness for think + ab-report (T18 / D19 CDX-18) Structural answer to CDX-18 (anti-bias rewrite may make advice worse). We don't have to guess whether calibration helps — we measure. Architecture: runAbTrial(input) — calls thinkRunner TWICE on the same question (baseline + --with-calibration), surfaces both answers to a preferenceResolver, persists the trial to think_ab_results. buildAbReport(engine, { days }) — aggregates the table over the last N days (default 30). Computes win counts, ties, neither, and a with_calibration_win_rate over DECISIVE trials only (excludes neither/tie). Flags calibration_net_negative when n >= 20 AND win rate < 45%. formatAbReport(report, days) — pretty-prints for stdout; emits the calibration_net_negative warning block when triggered. CLI: gbrain calibration ab-report [--days N] [--json] Reads the table, prints the breakdown. Replaces the v0.36.0.0 ship-state placeholder in src/commands/calibration.ts. gbrain think --ab "<question>" Wires into runAbTrial via the dispatch in src/commands/think.ts — follow-up commit. This commit lands the harness layer + schema + report surface; the --ab flag itself flips on in a one-line wiring commit when the runRecall path is ready. Schema (migration v72 / think_ab_results): source_id, wave_version, ran_at, question, baseline_answer, with_calibration_answer, preferred (CHECK in {baseline, with_calibration, neither, tie}), model_id, notes. CHECK constraint enforces preferred enum. Default wave_version 'v0.36.0.0' stamped so --undo-wave can scrub these too. Index on (source_id, ran_at DESC) supports the report's "last N days" query. schema.sql + pglite-schema.ts both updated for fresh-install parity. schema-embedded.ts regenerated via build:schema. calibration_net_negative threshold (D19): Triggers when: - decisive_trials (baseline + with_calibration) >= 20 - with_calibration_win_rate < 0.45 (NOT <= — exact 45% is OK) Small-sample guard (n < 20) prevents the warning from firing on early data with sampling noise. Confidence-flat threshold (no Wilson CI yet) keeps the math simple; v0.37+ adds CI bounds. Tests: 12 cases in test/think-ab.test.ts. runAbTrial (4): both runner calls fire, preferenceResolver receives both answers, INSERT row params shape, throws when thinkRunner missing. buildAbReport (5): zero trials, aggregation, net_negative trigger at n>=20 + win<45%, no trigger at n<20 (small-sample guard), no trigger at exact 45% boundary. formatAbReport (3): zero-state message, decisive-trials breakdown, net_negative warning block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: pattern drill-down route + revisit-now CLI (TD3 / D29 + TD4 / D30) TD3 (D29) — clickable pattern drill-down endpoint: GET /admin/api/calibration/pattern/:id (requireAdmin) Returns the pattern statement at index `id` plus the top 25 resolved takes for the holder, sorted by weight desc. v0.36.0.0 ship-state approximation: surfaces broad provenance evidence (top resolved takes). v0.37+ stores per-pattern source_take_ids[] on a calibration_profile_patterns join table so the drill-down shows the EXACT takes that drove the pattern. Surfaces a `provenance_note` field in the response so the operator sees the v0.36.0.0-vs-v0.37 fidelity boundary inline. The admin SPA's renderPatternStatementsCard SVG already emits anchor tags pointing at /admin/calibration/pattern/<i> (T15 ship state). This route makes those anchors clickable — closes the trust loop that was the rationale for D29 ("pattern statements without their evidence are dressed-up LLM hallucinations"). TD4 (D30) — `gbrain takes revisit <slug>` editor-open action: Adds the `revisit` subcommand to gbrain takes. Opens $EDITOR (falling back to vi) on the source markdown file for the slug. Appends a `<!-- gbrain:revisit -->` cursor marker at the bottom of the page on first invocation so the editor opens with intent visible. Reads sync.repo_path from config to locate the brain repo. Refuses to proceed with a clear error when the repo isn't configured or the page doesn't exist. spawnSync with stdio:'inherit' so the editor takes the terminal. Exit status surfaced on failure. The SVG renderer's revisit-now anchor for each abandoned thread row emits /admin/calibration/revisit/<takeId>. A small route handler that resolves take_id → page_slug then dispatches `gbrain takes revisit` via spawn is a v0.37 follow-up — the CLI command exists now so developers can wire it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: DESIGN.md — formalize de facto design tokens (TD1) Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a canonical DESIGN.md at the repo root. This is the calibration target for /plan-design-review and /design-review going forward — when a question is "does this UI fit the system?", the answer is here. Captures the system as it stands today: Voice (5 surfaces, all routed through gateVoice() with mode-specific rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption, morning_pulse. Friend-not-doctor; concrete data over abstract metrics; no preachy / clinical / corporate language. Color tokens: 10 CSS variables from admin/src/index.css inlined into the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme is the only theme — admin is an operator tool. WCAG contrast documented per token; TD2's #555 → #777 bump on --text-muted noted. Typography: Inter for UI, JetBrains Mono for numbers/slugs/data. Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet formalized. Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density. Layout: sidebar 200px, max content 720px (text) / 960px (tables). No 3-column feature grids, no icons in colored circles, no decorative blobs. Charts: server-rendered SVG via pure functions in src/core/calibration/svg-renderer.ts. XSS posture documented: server-side escapeXml on caller-controlled strings, numeric inputs .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper. Interaction patterns: keyboard nav required (J/K/space/u/q on the propose-queue), loading/empty/error states ARE features. v0.37+ roadmap: type scale formalization, animation tokens, component library extraction. Light mode explicitly NOT planned. The doc is a living target, not a frozen spec. Major changes route through /plan-design-review per the existing review chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: synthetic corpus scaffold + privacy CI guard (T19 + T20) T19 — synthetic corpus scaffold for extract-takes prompt tuning. test/fixtures/calibration/extract-takes-corpus/ — 5 representative pages across 4 genres (essay, people, companies, meetings, decisions). v0.36.0.0 ships a SMALL representative corpus as proof of structure; the full 50-page training set + 10-page holdout gets generated by the operator via `gbrain calibration build-corpus` (v0.37 follow-up subcommand) or by hand with the privacy guard catching violations either way. Privacy contract per D13': every page is SYNTHETIC. None of the names/companies/funds/deals/events refer to anything real. Placeholder names per CLAUDE.md: alice-example, charlie-example, acme-example, widget-co, fund-a/b/c, acme-seed, widget-series-a, meetings/2026-04-03. test/fixtures/calibration/README.md spells out the privacy contract, generation flow, and what the corpus is (stable regression set for the extract-takes prompt) vs is not (real anything). T20 — privacy CI guard (CDX-14 mitigation). scripts/check-synthetic-corpus-privacy.sh greps the corpus for: 1. Explicit dollar amounts ($50M, $1.2B etc) — would suggest the page memorized a real round size. 2. Out-of-range year references (informational only for v0.36.0.0; deferred to a manual review checklist). 3. Pages that reference ZERO placeholder names — suggests the page might be referring to real entities. Essay-genre fixtures exempt (they're anonymized PG-style writing by design). Wired into `bun run verify` (CI gate) so contributors can't accidentally land a synthetic fixture that leaks real-world specificity. The intent is fail-fast on accidental leakage; the operator can update the allowlist if a generic dollar amount is intentional. Closes CDX-14: 'CC reads real brain pages locally, writes nothing still risks privacy if any generated synthetic fixture memorizes structure-specific facts. Placeholder names are not enough.' The corpus shipped here is intentionally small but covers the four core gbrain page genres (essay, people, companies, meetings/decisions). The v0.37 corpus-build subcommand will fan out to 50 with the operator spot-checking + the CI guard enforcing the privacy contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: R1-R5 IRON RULE regression inventory (T21) Per /plan-eng-review D26 IRON RULE: regressions get added to the test suite as critical requirements, no AskUserQuestion needed. Pins five regressions identified during the v0.36.0.0 wave's coverage diagram: R1: think baseline UNCHANGED when --with-calibration absent. Covered structurally by test/think-with-calibration.test.ts plus assertion-pinned in this file (default user message: question first, then retrieval; system prompt: no anti-bias section). R2: contradictions probe output UNCHANGED when no calibration profile. Covered structurally by test/eval-contradictions-calibration-join.test.ts plus pinned here (null profile → null tag, byte-identical to v0.32.6). R3: takes resolution flow works when grade_takes phase disabled. Pinned import-surface coupling: takes-resolution.ts has zero dependency on grade_takes module. If a future refactor accidentally couples them, this test fails to compile. R4: search/list_pages/get_page work identically through new source_id paths. Marker test referencing existing v0.34.1 source-isolation suite at test/source-isolation-pglite.test.ts. v0.36.0.0 does NOT modify those code paths; the existing tests catch any accidental coupling. R5: existing search modes (conservative/balanced/tokenmax) unaffected. Marker test referencing existing test/search-mode.test.ts. The calibration code DOES NOT IMPORT from src/core/search/mode.ts. Plus an inventory test that confirms all 5 regressions have an 'addressed' status — fail-loud if a future contributor removes a guard without updating the inventory. 7 tests total. Pure functions, no engine, hermetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: v0.36.0.0 CHANGELOG + CLAUDE.md anchors + calibration convention skill CHANGELOG entry: the user-facing release notes. Leads with the headline ("the brain learns how you tend to be wrong, then argues against your blind spots on every advice call"), 5 'what you can now do' bullets in GStack voice, itemized changes by lane, and the 'To take advantage of v0.36.0.0' upgrade checklist per the CLAUDE.md required-block contract. CLAUDE.md anchors: new 'v0.36.0.0 Hindsight calibration wave (key files cluster)' block inserted before the v0.31.1 thin-client section. 23 new files / extensions annotated with one-paragraph descriptions each, linking back to the convention skill at skills/conventions/calibration.md for the agent-facing rules. skills/conventions/calibration.md: the agent-facing convention skill. Tells future contributors which calibration touchpoint applies to their task — voice gate? BaseCyclePhase? source-scope thread? doctor warning? cross-brain query rules? auto-resolve threshold posture? Test seam patterns. Bug class to avoid (the v0.34.1 source-isolation leak shape). Version trio (per CLAUDE.md mandatory audit): VERSION: 0.36.0.0 package.json: 0.36.0.0 CHANGELOG: ## [0.36.0.0] - 2026-05-17 llms.txt + llms-full.txt regenerated via `bun run build:llms` after the CLAUDE.md edit (per the explicit CLAUDE.md mandate "Any CLAUDE.md edit MUST be followed by `bun run build:llms`"). The `test/build-llms.test.ts` guard runs in CI shard 1; the committed bundles are checked against fresh generator output. bun run verify is clean. typecheck clean. Privacy CI guard passes (0 violations across 6 corpus pages). All ready for /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: wire propose_takes / grade_takes / calibration_profile into runCycle (T-fix) The three new v0.36.0.0 phases were declared in CyclePhase / ALL_PHASES / NEEDS_LOCK_PHASES but the runCycle orchestrator never dispatched them. ALL_PHASES advertised them, gbrain dream --phase propose_takes accepted them, but `gbrain dream` (default) silently skipped all three. Adds a single dispatch block between consolidate and embed that: - builds an OperationContext on the fly (trusted-workspace caller, remote: false, sourceId resolved via the same helper sync uses) - dispatches the three phases in the order ALL_PHASES declares - records the same skipped-phase shape (no_database) when engine is null Pinned by test/core/cycle.serial.test.ts "default: all 6 phases run in order" which was already failing against ALL_PHASES (the test name lags the actual phase count; left as-is since renaming churns history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: expand synthetic corpus + add hand-labeled ground-truth (T19) Adds 8 new synthetic pages modeled on the genre mix observed in the real brain (concepts-with-timeline, meeting-notes, daily-journal, people-pages, essays). Companion .gradeable-claims.json files carry hand-labeled answer keys — what a tuned propose_takes prompt SHOULD extract per page. Closes the F1 gate gap from the plan's T19/D19: Training corpus (test/fixtures/calibration/extract-takes-corpus/): + concept-startup-market-dynamics.md (10 claims) + meeting-2026-04-10-fundraise-fund-a.md (6 claims) + daily-2026-04-15.md (5 claims) Blind holdout (test/fixtures/calibration/holdout/): + concept-founder-execution.md (6 claims, F1 >= 0.80) + daily-2026-04-18.md (4 claims, F1 >= 0.80) + meeting-2026-04-17-hiring-charlie.md (5 claims, F1 >= 0.80) + essay-on-conviction.md (7 claims, F1 >= 0.80) + people-bob-example.md (5 claims, F1 >= 0.80) Privacy: - No real-brain content read into any committed artifact. Pages written from scratch using the canonical placeholder set (alice-example, charlie-example, bob-example, acme-example, widget-co, fund-a/b/c). Real-name grep confirms zero leakage: wintermute, garrytan, paul-graham, sam-altman, etc. → 0 hits. - scripts/check-synthetic-corpus-privacy.sh passes: 0 violations across 14 pages (was 6). Genre fidelity: - concept-with-timeline pages mirror the dated-assertion structure real brain uses (verb framing varies: "argues / predicts / I think / I bet / strong conviction / moderate conviction"). - meeting-notes pages carry both prose claims (extracted via hedging language) and explicit ## Takes sections. - daily-journal pages test probabilistic framing ("75/25 in favor", "call it ~0.5") and self-tagged conviction values. - essay-on-conviction is the meta-page that names the author's own bias patterns — primary signal for calibration_profile. - people pages test claim-about-third-party extraction. Each JSON ground-truth lists per-claim: - claim_text + kind (prediction|judgment|bet) + domain - conviction (0..1) - since_date - rationale (why this claim is gradeable + how a tuned prompt should infer conviction from the prose) This is the corpus that gates the T19 prompt-tune iteration: - F1 >= 0.85 on training (10+6+5 = 21 claims across 3 pages plus the existing 5 fixtures already shipped) - F1 >= 0.80 on holdout (27 claims across 5 pages) Plan reference: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md Privacy gate: scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: tune propose…
) * fix(sync): accept .tf / .tfvars / .hcl in CODE_EXTENSIONS Terraform repos were invisible to `gbrain sync --strategy code` because the three HCL-family extensions never reached the file walker. Silent data loss — the user thinks the sync covered the repo but the IaC layer was dropped on the floor. detectCodeLanguage() returns null for these extensions, so the chunker falls back to recursive (no tree-sitter grammar for HCL) — the same path toml/yaml take. Closes garrytan#878. Co-Authored-By: johnybradshaw <johnybradshaw@users.noreply.github.com> * fix(upgrade): run `bun update gbrain` from Bun's global install root `gbrain upgrade --strategy bun` was failing on canonical `bun install -g github:garrytan/gbrain` installs because `execSync('bun update gbrain')` ran in the user's shell cwd. Bun's update operates on whatever package.json it finds via cwd-walk, so a user not standing in the global root got "No package.json, so nothing to update". resolveBunGlobalRoot() returns the right directory: 1. `$BUN_INSTALL/install/global` when set (operator override). 2. `~/.bun/install/global` (Bun's documented default). 3. Walk up from realpath(argv[1]) looking for `node_modules/gbrain` — handles non-standard installs without trusting argv naming. execFileSync replaces execSync (no shell), with cwd pinned. Error path prints the exact `cd && bun update` recovery command instead of a vague hint. Closes garrytan#1029. Cherry-picked from PR garrytan#1032. Co-Authored-By: mvanhorn <mvanhorn@users.noreply.github.com> * fix(config): redact sensitive values in `config set` output (closes garrytan#892) `gbrain config set openai_api_key sk-...` was echoing the full key to stderr via `console.log('Set %s = %s', key, value)`. Shell scrollback and tmux scroll buffers commonly retain stderr for hours; a screen-share or shoulder-glance during set leaked the secret. The `show` path already redacted but used a naive `.includes('key')` substring check that would mask 'monkey' or 'parsekey' (no false-negative but ugly). Single source of truth: `isSensitiveConfigKey()` uses a word-boundary regex (`(^|[._-])(key|secret|token|password|pwd|passwd|auth)([._-]|$)/i`) so 'openai_api_key' matches but 'monkey' doesn't. `redactConfigValue()` composes the postgresql:// URL redactor + sensitive-key check, used by both `show` and `set`. Helpers exported for unit tests. Closes garrytan#892. Cherry-pick of @sharziki's PR garrytan#918 (config.ts hunk only — the extract.ts walker change in that PR is unrelated and tracked in garrytan#202). Co-Authored-By: sharziki <sharziki@users.noreply.github.com> * fix(oauth): throw InvalidTokenError so bearerAuth returns 401, not 500 `verifyAccessToken` was throwing bare `Error` on expired or invalid tokens. The MCP SDK's `requireBearerAuth` middleware catches `InvalidTokenError` and returns 401 with WWW-Authenticate; bare Error falls through to 500. Result: legitimate clients with stale tokens hit 500-not-401, so token-refresh logic (which keys off 401) never fires. Two call sites in verifyAccessToken: token-expired path and invalid-token path. Both now throw InvalidTokenError. Existing tests continue to pass because they assert on the throw, not the message class. Closes garrytan#935. Cherry-picked from PR garrytan#1012. Co-Authored-By: Aashiqe10 <Aashiqe10@users.noreply.github.com> * fix(serve): return 405 on GET /mcp instead of 404 MCP Streamable HTTP spec says GET /mcp opens an optional SSE backchannel for server-initiated messages. gbrain's transport is stateless and doesn't push server-initiated messages, so per spec we MUST return 405 with Allow: POST, DELETE — not 404. Probing clients (claude.ai, etc.) distinguish "endpoint exists, no SSE channel" from "endpoint missing" on this status code; 404 makes them give up. Cherry-picked from PR garrytan#1076. Co-Authored-By: lukejduncan <lukejduncan@users.noreply.github.com> * fix(doctor): resolve whoknows fixture from module location, not cwd `gbrain doctor` warned about a missing whoknows fixture for every install that wasn't standing in the gbrain source repo at run time — which is everyone. The check used `process.cwd()` to locate the fixture, so any real user (running doctor against `~/.gbrain`) saw a spurious warning. `resolveWhoknowsFixturePath()` walks up from `import.meta.url` looking for the source-repo signature (`src/cli.ts` + `skills/RESOLVER.md`), respects `GBRAIN_WHOKNOWS_FIXTURE_PATH` env override (absolute or cwd-relative), and returns null with an actionable warning when the fixture can't be located. Closes garrytan#969. Cherry-picked from PR garrytan#1034. Co-Authored-By: mvanhorn <mvanhorn@users.noreply.github.com> * fix(frontmatter): centralize --fix backups under ~/.gbrain/backups/ `gbrain frontmatter validate --fix` and `gbrain frontmatter generate --fix` wrote `<file>.bak` siblings into the source tree. Users running gbrain over a brain repo found .bak files scattered through people/, companies/, etc. that broke gitignore expectations and showed up in `git status` after every fix pass. Backups now land under `~/.gbrain/backups/frontmatter/<run-id>/<rel>.bak` with an iso-week-sorted run-id so a multi-fix session keeps the same parent directory. Backup directory + per-file structure mirrored from the original file's relative path. The .bak safety contract is intact for both git and non-git brain repos. Also adds `--include-catch-all` opt-in to `frontmatter generate` so the default catch-all rule (`type: note`) is no longer applied to arbitrary workspace documents that happen to live under a brain root. Closes garrytan#902. Cherry-picked from PR garrytan#903. Co-Authored-By: 100yenadmin <100yenadmin@users.noreply.github.com> * fix(config): use path.isAbsolute() for GBRAIN_HOME on Windows The GBRAIN_HOME validator rejected every valid Windows path (`C:\\Users\\...`, `D:\\gbrain`, etc.) because it used `trimmed.startsWith('/')` to check for absoluteness — only POSIX absolute paths pass that. `path.isAbsolute()` is the cross-platform check. Same fix for the `..` traversal check: split on both `/` and `\` so Windows path separators don't sneak `..` through. Closes garrytan#1019. Cherry-picked from PR garrytan#1083. Co-Authored-By: sharziki <sharziki@users.noreply.github.com> * fix(ai): warn only for the configured embedding provider, not all recipes Gateway construction was warning on stderr for every recipe with an embedding touchpoint missing max_batch_tokens — including providers the brain isn't using. Users on Voyage saw noise about OpenAI / Google / DashScope / etc. recipes that never get loaded. Filter the warning to recipes whose provider id is referenced by `embedding_model` or `embedding_multimodal_model` in the active config. The structural protection against forgetting max_batch_tokens stays in place for the recipes that actually run; the noise for unrelated recipes goes away. Cherry-picked from PR garrytan#1117. Co-Authored-By: hnshah <hnshah@users.noreply.github.com> * fix(sync): skip git pull when repo has no origin remote `gbrain sync` ran `git pull` unconditionally and printed scary stderr on every cycle for brains that have no `origin` remote (local-only workflows, single-machine setups, brains initialized via `gbrain init --pglite` against an arbitrary directory). The pull failed harmlessly but the noise was confusing and made operators think sync was broken. `hasOriginRemote()` probes `git remote get-url origin` with stdio ignored; on failure (`no such remote`), skip the pull, print a single informational line, and proceed with the local working tree. Cherry-picked from PR garrytan#1119. Co-Authored-By: hnshah <hnshah@users.noreply.github.com> * fix(query): drain cache writes before CLI exit The query cache write was fired with `void promise.catch(...)` — true fire-and-forget. On a fast CLI invocation (`gbrain query <q>` exits in ~50ms), the process terminates before the cache write commits. Result: the cache effectively never warms from CLI use; every query is a miss. `awaitPendingSearchCacheWrites()` tracks each in-flight cache write in a module-level Set. The CLI dispatcher awaits the set after `query` finishes formatting output but before the process exits. MCP server path unchanged (long-lived process, fire-and-forget remains correct). Cherry-picked from PR garrytan#1125. Co-Authored-By: hnshah <hnshah@users.noreply.github.com> * fix(backlinks): dedupe (source, target) pairs within a single source page A source page that mentions the same entity N times produced N duplicate "Referenced in" lines on the target. `extractEntityRefs` returns one EntityRef per occurrence, and the per-ref `hasBacklink` check reads a snapshot of `target.content` that's frozen at outer scope — so every iteration sees "no backlink yet" and appends another gap. The cumulative effect on a long meeting note with multiple mentions of the same person was visible in PRs landing 3-5 identical Timeline entries. Track seen target slugs per source page; cap gaps at one pair. Cherry-picked from PR garrytan#967 with a current-master regression test covering both markdown-link and Obsidian-wikilink formats in the same source page. Co-Authored-By: p3ob7o <p3ob7o@users.noreply.github.com> * fix(dream): audit backlinks without mutating pages during cycle The dream/autopilot maintenance cycle ran the backlinks phase in 'fix' mode, which writes "Referenced in" timeline bullets into entity pages every sync. The graph extractor + auto-link path is the canonical link store during sync/dream/autopilot — the legacy filesystem fixer wrote markdown that fought with both the user's manual edits and the graph layer's own timeline. Cycle now runs backlinks in 'check' mode (audit-only); the materializer remains available via `gbrain check-backlinks fix` for users who really want markdown backlinks committed to disk. Cherry-picked from PR garrytan#1027. Co-Authored-By: sliday <sliday@users.noreply.github.com> * fix(autopilot --install): source ~/.zshenv before zshrc/bashrc zshenv is the canonical place for env vars in zsh on macOS — zshrc is sourced only for interactive shells, so vars exported in zshrc don't reach a non-interactive subprocess like the autopilot wrapper. Users who exported GBRAIN_DATABASE_URL, OPENAI_API_KEY, or ANTHROPIC_API_KEY in zshrc and assumed autopilot would inherit them hit silent missing- secret failures on the LaunchAgent. Source ~/.zshenv first (always reaches non-interactive shells per zsh docs), then fall back to ~/.zshrc / ~/.bashrc for users on other profile conventions. Cherry-picked from PR garrytan#966. Co-Authored-By: p3ob7o <p3ob7o@users.noreply.github.com> * fix(apply-migrations): return exit 0 on list/dry-run/up-to-date `gbrain apply-migrations list`, `gbrain apply-migrations --dry-run`, and the "All migrations up to date" path were returning from the async function but never calling `process.exit(0)`. The CLI dispatcher in cli.ts treated the implicit fall-through as exit 1 when the parent process inspected status via shell scripts, breaking automation that gates on `apply-migrations list && do-something`. Three call sites: list, dry-run, and the no-op path. All three now exit(0) explicitly. Cherry-picked from PR garrytan#1062. Co-Authored-By: nezovskii <nezovskii@users.noreply.github.com> * fix(sync): scope auto-embed to source on incremental syncs `gbrain sync --source-id X` triggered auto-embed for the affected slugs but `runEmbed` ran with no `--source` flag, so it fell back to the default source. For non-default-source syncs the page row lives at (sourceId, slug) — the embed code saw "Page not found" for the right slug under the wrong source, swallowed the error as best-effort, and the sync result reported `embedded: 0` for the wrong reason. `buildAutoEmbedArgs(slugs, sourceId)` is the new helper: when sourceId is set, prepends `--source X`. Exported for the regression test. Pairs with the upcoming source-id write-path audit (P1 #8). Cherry-picked from PR garrytan#1120. Co-Authored-By: hnshah <hnshah@users.noreply.github.com> * fix(query): honor source_id with no-expand for cross-source search Two related corrections: 1. `gbrain query --no-expand` parsed `--no-expand` as the literal key `no_expand` instead of negating the boolean `expand` param. Result: the flag was silently ignored and expansion always ran. Now any `--no-<key>` where `<key>` is a boolean param flips it false. 2. The `query` op's source-id resolution treated `ctx.sourceId` as authoritative, so an explicit per-call `source_id` was overridden by the federated read scope. Now per-call `source_id` wins; `source_id=__all__` is an explicit opt-out for local cross-source search. Cherry-picked from PR garrytan#1124. Co-Authored-By: hnshah <hnshah@users.noreply.github.com> * fix(doctor): child-table orphan detection (closes garrytan#1063) The autopilot orphans phase detects orphan PAGES (no inbound links via page-graph) but never scans FK-child tables. After a bulk delete or a pre-FK-migration code path, orphan rows can persist indefinitely in content_chunks, page_versions, tags, takes, raw_data, timeline_entries, or links — all declared ON DELETE CASCADE, so any orphan row is unexpected. `childTableOrphansCheck` enumerates 10 FK columns across 8 tables: - 8 NOT NULL columns (cascade): any value not in pages.id is an orphan. - 2 nullable SET NULL columns (links.origin_page_id, files.page_id): NULL is valid; only NOT-NULL-but-missing-in-pages counts. Surfaces paste-ready cleanup SQL when orphans are found. Cherry-picked from PR garrytan#1064. Co-Authored-By: vincedk-alt <vincedk-alt@users.noreply.github.com> * fix(autopilot,cycle): stop respawn-storm from steady-state 'partial' cycles Two compounding bugs under KeepAlive=true: 1. Autopilot tripped its circuit breaker on cycle.status === 'partial', not just 'failed'. 'partial' means at least one phase warned/failed while others ran — a soft signal, not fatal. On every cycle that warned, autopilot logged a failure and the supervisor respawned the worker. 2. The orphans phase emitted 'warn' when `count > 20` orphan pages. That threshold was tuned for small dev brains; on any corpus past a few hundred pages it fires every cycle in steady state. Together with bug 1, this produced visible respawn storms. Fix: - Autopilot trips only on cycle.status === 'failed'. - Orphans phase warns by ratio: orphans / total_pages > 0.5 (the real "your graph fell apart" signal), not by absolute count. Cherry-picked from PR garrytan#1113. Co-Authored-By: sergeclaesen <sergeclaesen@users.noreply.github.com> * fix(ai): reject partial embedding responses before indexing `embedSubBatch` only validated the FIRST embedding's dimension and never asserted the response length matched the input length. If a provider returned fewer embeddings than requested (rate-limit truncation, malformed response, etc.), the gateway silently indexed an offset-shifted result — every page after the missing index got the embedding of a different page's chunk. Two new guards: 1. `result.embeddings.length === texts.length` — fail loud if any count mismatch, with a paste-ready retry hint. 2. Validate dim on EVERY embedding, not just the first. Cherry-picked from PR garrytan#926. Co-Authored-By: 100yenadmin <100yenadmin@users.noreply.github.com> * fix(serve): admin register-client supports auth_code + PKCE public clients The admin dashboard's /admin/api/register-client endpoint hardcoded client_credentials and ignored grantTypes, redirectUris, and tokenEndpointAuthMethod. Result: you couldn't register a browser-based PKCE client (claude.ai Custom Connector, Cursor, etc.) through the dashboard — only confidential machine-to-machine clients worked. Pass grantTypes / redirectUris through to registerClientManual. When tokenEndpointAuthMethod === 'none', NULL out client_secret_hash so the SDK's clientAuth middleware skips the hash-vs-plaintext compare that would otherwise reject the no-secret PKCE flow. Cherry-picked from PR garrytan#1077. Co-Authored-By: lukejduncan <lukejduncan@users.noreply.github.com> * fix(extract-facts): treat slugs:[] as no-op, not unscoped full-walk `runExtractFacts` checked `opts.slugs && opts.slugs.length > 0` to decide between scoped and full-brain walk. Both `undefined` (caller omits → full walk intended) AND `[]` (sync no-op → zero work intended) fall through to the same `else` branch and triggered `engine.getAllSlugs()`. On a multi-thousand-page brain, the unintended full walk exceeded the autopilot-cycle ~600s timeout and dead-lettered the job — visible in production as `[cycle.extract_facts] start` followed by silence until `Autopilot stopping (cycle-failure-cap)`. Use presence (`opts.slugs !== undefined`), not truthiness, to distinguish the two modes. Empty array is a real incremental no-op. Closes garrytan#1096. Three regression cases in test/extract-facts-phase.test.ts: slugs=[] no-op, slugs=undefined still walks, slugs=['a'] walks just one. Co-Authored-By: navin-moorthy <navin-moorthy@users.noreply.github.com> * fix(serve): embed admin/dist into binary; serve from manifest (closes garrytan#1090) Pre-fix, /admin returned 404 on every globally-installed binary because serve-http.ts:780 resolved admin/dist via process.cwd(). The admin SPA files are checked into git but `bun build --compile` does NOT embed arbitrary directories — only assets imported via `with { type: 'file' }` ESM imports land in the compiled binary. Wire: - scripts/build-admin-embedded.ts walks admin/dist/, emits src/admin-embedded.ts with one `with { type: 'file' }` import per file + a manifest map (request path → resolved path + mime). Auto-invoked by `bun run build:admin`. - src/admin-embedded.ts is the auto-generated module. Bun resolves every file: import to a path that works at runtime inside the compiled binary (same pattern as src/core/chunkers/code.ts WASM imports). - serve-http.ts switches to two-tier resolution: cwd-relative admin/dist for dev (Vite hot-rebuild), embedded manifest otherwise. Embedded path reads bytes lazily and caches per-asset for the lifetime of the process. - scripts/check-admin-embedded.sh CI gate re-runs the generator and fails on drift (mirrors check-wasm-embedded.sh). PRs that rebuild admin/dist but forget to regenerate the embedded module fail loud. - package.json wires build:admin-embedded + check:admin-embedded. Closes garrytan#1090. * test(source-id): lock in routing regression coverage (closes garrytan#891 garrytan#978 garrytan#1078) Audit of every page write path (sync, embed, extract, dream, autopilot, wikilinks, tags, chunks) confirmed that sourceId already threads correctly through importFromContent → engine.putPage → SQL INSERT since v0.18.0. The original bug reports from garrytan#891, garrytan#978, garrytan#1078 were real at the time and got swept by the multi-source refactor; today's master is correct. This commit locks in that correctness with six PGLite regression cases (no Postgres fixture needed; runs in CI everywhere): 1. importFromContent({sourceId:"work"}) lands at source_id=work, not the silent 'default' fallback. 2. Two sources hold the same slug independently. 3. Omitting sourceId falls through to 'default' (legacy contract). 4. Chunks land under the requested source. 5. Tags land under the requested source. 6. FK integrity smoke (originally garrytan#1078). The earlier issue reports stay closed by the existing threading; this suite ensures any future refactor of the write path can't silently re-introduce the wrong-source-default bug. The 90-minute write-path audit budget from the plan resolves here. * fix(apply-migrations): unblock PGLite chain (closes garrytan#1100) `gbrain apply-migrations --yes` was wedging on the v0.11.0 (Minions) schema phase for PGLite installs. Two compounding bugs: 1. `apply-migrations` pre-flight schema-version warning connects to PGLite to read config.version, then disconnects. The brief lock hold races with downstream subprocess spawns that try to re-acquire it; the 30s lock timeout fires before the parent fully releases. Pre-flight is a *warning*; on PGLite it adds no information the orchestrators don't already handle. Skip the probe for PGLite. 2. v0.11.0 phase A spawned `gbrain init --migrate-only` as an execSync subprocess to apply schema migrations. PGLite is single-writer; the subprocess inherits HOME and tries to lock the same DB. On Postgres this works (concurrent connections OK); on PGLite it deadlocks. Route in-process for PGLite — create + connect + initSchema + disconnect directly, skipping the subprocess hop. Postgres keeps the legacy execSync path. Verified: fresh PGLite install now walks the full migration chain through v0.32.2 (Facts SoR) and lands "All migrations up to date" on re-run. Closes garrytan#1100. * fix(serve): bootstrap token env override + suppress flag (closes garrytan#1024) `gbrain serve --http` regenerated the admin bootstrap token on every restart and printed it to stderr. In supervisor-managed production deployments (LaunchAgent, systemd, k8s) every restart leaks the value into log aggregators and rotates the access for any agent that paste- copied it. Two new knobs: - **GBRAIN_ADMIN_BOOTSTRAP_TOKEN** env var: when set, used as the bootstrap secret instead of a fresh per-process token. Validated: must match `^[A-Za-z0-9_-]{32,}$` (32-char minimum), else refuse to start with a paste-ready generator hint. Failing closed beats silently accepting a weak token. - **--suppress-bootstrap-token** CLI flag: suppresses the printed token line entirely. Operator takes responsibility for tracking the value out-of-band. Startup banner now reflects the chosen source: - `Admin Token: suppressed` when the flag is set. - `Admin Token: from $GBRAIN_ADMIN_BOOTSTRAP_TOKEN` when env-sourced. - Full token print only when both are absent (default behavior, dev installs). Closes garrytan#1024. Co-Authored-By: billy-armstrong <billy-armstrong@users.noreply.github.com> * fix(config): migrate legacy 'provider' + 'model' to 'embedding_model' Pre-v0.32 docs and some community templates used a config shape: { "provider": "voyage", "model": "voyage-4-large" } The canonical shape (since the v0.31.12 gateway seam) is: { "embedding_model": "voyage:voyage-4-large" } Users on the legacy shape hit silent fallthrough to the hardcoded OpenAI default; sync + embed errored out with "OpenAI embedding requires OPENAI_API_KEY" regardless of their actual provider config. loadConfig() now translates the legacy keys at parse time: - emits a one-line stderr nudge with the paste-ready canonical key - preserves the rest of the config unchanged - skipped when `embedding_model` is already set (forward-compat) Closes garrytan#1086. Co-Authored-By: jeunessima <jeunessima@users.noreply.github.com> * chore(test): quarantine upgrade tests (process.env mutation) PR garrytan#1032's cherry-picked tests use the static-snapshot + try/finally pattern for env vars instead of the project's withEnv() helper. The test-isolation lint catches process.env mutations outside withEnv to prevent cross-test leakage in parallel runs. Renaming to *.serial.test.ts (the quarantine convention) is the documented out: runs sequentially, no cross-file race. A future cleanup PR can migrate the tests to withEnv() and drop the quarantine. * fix(test): update brain-writer .bak assertion for centralized backup path The v0.36.x frontmatter backup change (bd60cdf — closes garrytan#902) moved .bak files from sibling-of-source to ~/.gbrain/backups/frontmatter/... The old test still asserted on the sibling path, so CI failed even though the production behavior was correct. Updated assertion contract: backup lands under the injected backupRoot (test-isolated), the returned backupPath ends in .bak and exists, and no sibling .bak is created next to the source file. The pre-fix sibling-path is now a negative assertion. * chore: bump version and changelog (v0.36.1.0) v0.36.1.0 — community fix wave (28 atomic fixes + 22 PRs closed as already-shipped + 14 issues triaged). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(fix-wave): close test gaps surfaced by post-ship audit After the fix-wave shipped, an audit found 11 commits with no new test file. Some were inherently structural (build pipelines, shell content) or had existing test coverage that worked either way; others had real regression risk with no guard. This commit closes the gaps that matter. New regression tests for: - OAuth `verifyAccessToken` throws `InvalidTokenError` (not bare Error) on both expired and unknown token paths. Pre-fix, the SDK's `requireBearerAuth` middleware fell through to 500 instead of 401 → client token-refresh logic never fired (garrytan#935). - `loadConfig` translates legacy `{provider, model}` config shape to the canonical `embedding_model: <provider>:<model>`. 3 cases: pure legacy → migrated; canonical wins over legacy when both present; canonical-only is untouched. Pre-fix, Voyage/Cohere/Mistral users silently fell through to OpenAI (garrytan#1086). - `configDir` rejects relative paths; rejects `..` segments via both separators (regression guard for the Windows path acceptance fix garrytan#1019 / cherry-pick garrytan#1083). - `resolveBootstrapToken` (new exported helper extracted from `runServeHttp`). 9 cases: unset env generates fresh, valid env accepted, hyphens/underscores accepted, < 32 chars rejected, special chars rejected, whitespace trimmed, empty string rejected, 32-char boundary accepted, 31-char one-short rejected. Security-critical validation surface (garrytan#1024). - GET /mcp returns 405 with `Allow: POST, DELETE` (E2E case in `serve-http-oauth.test.ts`). Pre-fix, claude.ai and other probing MCP clients saw 404 and gave up (garrytan#1076). - apply-migrations `process.exit(0)` on list / dry-run / up-to-date paths. Source-shape assertion locks the rule in; shell scripts gating on `$?` work (garrytan#1062). - Autopilot wrapper sources `~/.zshenv` BEFORE `~/.zshrc`. zshenv is the canonical place for env vars in non-interactive zsh; without this ordering, LaunchAgent subprocesses never inherit secrets exported in zshrc (garrytan#966). - `test/fix-wave-structural.test.ts` consolidates source-shape regression guards for fixes whose behavior is hard to runtime-test without heavy mocking: query cache drain (garrytan#1125), admin embed manifest + handler (garrytan#1090), admin register-client PKCE branch (garrytan#1077), PGLite v0.11.0 phase A in-process routing (garrytan#1100), query `--no-expand` negation (garrytan#1124). 9 source-grep assertions. Refactored `runServeHttp` to extract `resolveBootstrapToken` as a pure helper. The boot path now consumes the helper's tagged-union result ({kind:'ok'|'error'}); side effects (`process.exit`, `console.error`) moved to the caller. Unit-testable without spinning up Express. Test counts: oauth 71 (was 69), config 20 (was 14), apply-migrations 19 (was 18), autopilot-install 5 (was 4), serve-http-bootstrap-token 9 (new file), fix-wave-structural 9 (new file). Net: +28 cases across 6 files; +1 new exported function with full coverage. Remaining audit gaps (deferred): - e82dda0 admin embed E2E (post-deploy curl smoke covers this) - d93fa81 apply-migrations PGLite chain E2E (already smoke-tested manually in the original commit; subprocess test would be flaky in CI without DATABASE_URL gating) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: close the two deferred E2E gaps from the post-ship audit Both gaps now have real behavior coverage. No DATABASE_URL needed (PGLite engine), so they run in standard unit CI alongside the rest of the suite. Serial quarantine because both spawn subprocesses + bind ports / write tmpdirs. test/admin-embed-spawn.serial.test.ts (4 cases, ~6s wall-clock): - Spawns `gbrain serve --http` from a fresh tmpdir so `process.cwd()/ admin/dist` does not exist — this forces the embedded-manifest branch (the one under test). Pre-fix, this exact setup hit 404. - GET /admin/ → 200 + SPA shell HTML (title + #root div), content-type text/html. - GET /admin/index.html → same body via explicit path. - GET /admin/agents → SPA fallback returns index.html for deep links. - GET /admin/api/stats → NOT 200 (regression guard: SPA fallback must not swallow /admin/api/* routes and silently return HTML to a JSON client). Closes garrytan#1090. test/apply-migrations-pglite-spawn.serial.test.ts (3 cases, ~25s): - Seeds a fresh PGLite config in a tmpdir, runs `gbrain init --migrate-only` + `gbrain apply-migrations --yes --non-interactive`. Pre-fix this hit "GBrain: Timed out waiting for PGLite lock" because apply-migrations' pre-flight probe + v0.11.0's phase A subprocess both wanted the single-writer lock. - Asserts exit 0, no "Timed out" string, no "Phase A failed" string, brain.pglite file written. - Re-run case: idempotent — "All migrations up to date" exits 0 (also locks in the garrytan#1062 exit-code fix end-to-end). - --list path exits 0 (third leg of the garrytan#1062 contract). Closes garrytan#1100. Pinned bootstrap token via GBRAIN_ADMIN_BOOTSTRAP_TOKEN env so the admin test doesn't have to scrape stderr; the startup banner format is allowed to drift, the /health probe is the readiness contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): consolidate PGLite spawn test to one end-to-end pass CI failed on test/apply-migrations-pglite-spawn.serial.test.ts (Ubuntu, bun 1.3.14). The previous shape ran 3 tests × ~3 spawns each. Each `bun run /abs/src/cli.ts` from a tmpdir cwd pays a full parse/transpile cost (no near-cwd .bun cache); on Ubuntu CI that compounds past the runner's per-test budget. Consolidated to ONE test that exercises the full lifecycle in one brain: init --migrate-only → apply-migrations --yes → re-run → --list. Four spawns instead of eight. Local wall-clock: 32s → 11.5s. All four assertion buckets preserved: no PGLite lock timeout, no Phase A failure, brain.pglite written, idempotent re-run "All migrations up to date" exits 0 (garrytan#1062 end-to-end), --list exits 0. Per-test timeout 480_000ms as insurance against the runner's --timeout=60000 default (bun's API spec: per-test wins). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(diag): dump apply-migrations output when CI exit != 0 The PGLite spawn test passes locally on macOS/bun 1.3.13 in ~11s end-to-end but fails on Ubuntu/bun 1.3.14 in 4.92s with apply.exitCode = 1 — fast enough that something is failing early, not timing out. The runCli helper captured stdout+stderr but never printed them, so the CI log only showed the bare assertion failure. This commit prints the captured streams from BOTH init and apply when the exit code mismatches expectation. After the next CI run we can read the actual error message and diagnose the Ubuntu-specific failure mode (likely BUN_INSTALL / HOME / PGLite WASM env quirk). No behavior change; pure diagnostic output gate on failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): shim `gbrain` on PATH for PGLite spawn test Root cause of the Ubuntu CI failure: the v0.11.0 orchestrator's phase B runs `execSync('gbrain jobs smoke')`. PGLite phase A now routes in-process (the garrytan#1100 fix), but phase B and several follow-up phases still shell out to the `gbrain` binary on PATH. Locally the binary resolves via `bun link`; on CI Ubuntu it does not exist on PATH, so execSync exits 127 → orchestrator returns 'failed' → apply-migrations exits 1. Test failed at 4.92s with exitCode=1, well before any timeout. Verified locally by removing ~/.bun/bin/gbrain to simulate CI: pre-shim: apply.exitCode=1 (same as CI) post-shim: apply.exitCode=0 in 8.4s The shim writes a tiny `gbrain` executable to a tmpdir that just `exec`s `bun run <repo>/src/cli.ts "$@"`. Prepended to PATH for the spawned subprocesses. Mirrors the production contract (gbrain on PATH) without depending on `bun link` having run in the CI image. Diagnostic dump from the previous commit stays — useful insurance for the next time something silently fails inside a spawned binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: johnybradshaw <johnybradshaw@users.noreply.github.com> Co-authored-by: mvanhorn <mvanhorn@users.noreply.github.com> Co-authored-by: sharziki <sharziki@users.noreply.github.com> Co-authored-by: Aashiqe10 <Aashiqe10@users.noreply.github.com> Co-authored-by: lukejduncan <lukejduncan@users.noreply.github.com> Co-authored-by: 100yenadmin <100yenadmin@users.noreply.github.com> Co-authored-by: hnshah <hnshah@users.noreply.github.com> Co-authored-by: p3ob7o <p3ob7o@users.noreply.github.com> Co-authored-by: sliday <sliday@users.noreply.github.com> Co-authored-by: nezovskii <nezovskii@users.noreply.github.com> Co-authored-by: vincedk-alt <vincedk-alt@users.noreply.github.com> Co-authored-by: sergeclaesen <sergeclaesen@users.noreply.github.com> Co-authored-by: navin-moorthy <navin-moorthy@users.noreply.github.com> Co-authored-by: billy-armstrong <billy-armstrong@users.noreply.github.com> Co-authored-by: jeunessima <jeunessima@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…arrytan#1136) * feat(dims): OpenAI text-embedding-3 Matryoshka range validation (D13) dimsProviderOptions now fail-loud at the embed boundary when the configured embedding_dimensions is outside the model's native range (1..1536 for -small, 1..3072 for -large). Paste-ready fix hint in the AIConfigError.fix field. Closes the silent-HTTP-400 path that would have bit OpenAI-fallback users on v0.36.0.0 ZE-default installs. 16 new test cases in test/ai/dims-openai.test.ts pinning the contract across native-openai and openai-compatible adapter paths. * feat(ai): flip defaults to ZeroEntropy zembed-1 1280d + zerank-2 reranker Default embedding model is now zeroentropyai:zembed-1 at 1280d via Matryoshka. Real-corpus benchmark: 2.2x faster than OpenAI, 2.6x cheaper at regular pricing, wins 11/20 head-to-head queries. 1280 is the closest valid ZE Matryoshka step to the prior OpenAI 1536d default (valid set: 2560/1280/640/320/160/80/40). 1024 (Voyage's step) is NOT on ZE's list — pinned by AIConfigError fail-loud in dims.ts. balanced mode bundle now defaults reranker_enabled=true. zerank-2 reshuffles 60% of top-1 results in benchmarks. Missing-key fail-open contract in src/core/search/rerank.ts handles unauthenticated cases. Opt out with: gbrain config set search.reranker.enabled false Existing tests updated (gateway.test.ts, search-mode.test.ts) and a new test/balanced-reranker-default.test.ts (10 cases) pins the fail- open invariants. * feat(retrieval-upgrade): RetrievalUpgradePlanner + interactive prompt UX New src/core/retrieval-upgrade-planner.ts is the consolidated planner that computes the brain's pending retrieval-upgrade work (chunker bumps + ZE switch) in one pass and applies the schema transition + config updates atomically. Tagged-union ApplyResult enum (D15): 'applied' | 'skipped_already_ applied' | 'skipped_no_work' | 'declined' | 'planned' | 'failed'. No string-parsing reasons. Three config keys (D12): ze_switch_prompt_shown (UI state), ze_switch_requested (user intent), ze_switch_applied (work done). Plus ze_switch_previous_snapshot (JSON, full prior config for --undo per D16) and ze_switch_declined_at (90-day re-ask window). Schema transition (D18) is atomic: DROP indexes + ALTER COLUMN + CREATE INDEX inside a single engine.transaction(). HNSW recreation is part of the same transaction — no silent slow-search window. C3 eligibility logic: ze_switch_offered iff NOT on ZE + NOT declined recently + NOT applied + (legacy default OR >100 pages). C4 cost math: MAX(chunker_pending, dim_pending) not SUM — one re-embed pass invalidates both surfaces simultaneously. New src/core/retrieval-upgrade-prompt.ts wires the planner to a TTY-only interactive prompt with two-line cost split (D10) and privacy callout for the reranker flip. Tests: test/retrieval-upgrade-planner.test.ts (24 cases) pins the state machine. test/asymmetric-encoding-contract.test.ts (6 cases) pins D17: search read path uses gateway.embedQuery() not embed(), asserted via __setEmbedTransportForTests mock. * feat(cli): gbrain ze-switch — manual lever for the ZE switch New gbrain ze-switch CLI with --dry-run, --json, --resume, --force, --undo, --non-interactive, --confirm-reembed, --ignore-missing-key flags. Mirrors the upgrade prompt's UX symmetry: --undo presents a cost-warning before re-embedding back to the prior width. src/cli.ts: dispatch case + CLI_ONLY entry. ze-switch owns its own engine lifecycle (mirrors the doctor pattern). test/ze-switch-cli.test.ts (11 cases): --help, --dry-run, --json, --non-interactive, --ignore-missing-key, --resume, --undo, --confirm-reembed. Uses captureExit harness to test process.exit() paths without breaking the test process. * feat(doctor): ze_embedding_health + embedding_width_consistency checks Two new doctor checks (D-A5): ze_embedding_health: when embedding_model starts with zeroentropyai:, verify ZEROENTROPY_API_KEY is set (env or config). Paste-ready setup hint with the signup URL on failure. embedding_width_consistency: cross-check that the configured embedding_dimensions matches the actual vector(N) column width on content_chunks.embedding. Catches the half-applied switch state (schema migrated but config write crashed) with a paste-ready gbrain ze-switch --resume hint. Wired into runDoctor between reranker_health and the existing sync_freshness checks. Both checks gracefully no-op on non-ZE embedding configs. test/doctor-ze-checks.test.ts (8 cases) pins both checks across happy + missing-key + missing-config + drift paths. Uses withEnv() helper to clear ZEROENTROPY_API_KEY for the no-key path so tests are hermetic against contributor env state. test/e2e/v0_28_5-fix-wave.test.ts + test/openai-compat-multimodal.test.ts: updated to explicit-configure the gateway when the test depends on specific dims that diverge from the v0.36.0.0 default (1280d). * docs: README zero-based rewrite (884 -> 139 lines) + new docs files Strip 4 months of accreted "New in v0.X.Y" hero blocks and reorganize around what gbrain does today. 33 H2s -> 8. The Commands section (136 lines duplicating gbrain --help) moved out; the 6-table skills enumeration collapsed to a one-paragraph capability description with a link to skills/RESOLVER.md. Hero retains load-bearing facts: OpenClaw + Hermes credit, production numbers (17,888 pages / 4,383 people / 723 companies), BrainBench numbers (P@5 49.1% / R@5 97.9% / +31.4 lift), ZE comparison numbers, 30-min install claim. Adds one paragraph announcing the v0.36.0.0 ZE default with the explicit gbrain config set escape for OpenAI/Voyage users. New files: - docs/INSTALL.md: every install path consolidated (agent platform, CLI standalone, MCP server). Thin-client mode covered. - docs/architecture/RETRIEVAL.md: why the hybrid + graph stack works. BrainBench numbers, why each strategy alone fails, the source-aware ranking + intent classification + multi-query expansion story. - docs/ethos/ORIGIN.md: origin story lifted from the old README so the front door stays factual + concrete. test/readme-hero-anchors.test.ts (5 cases) is the D9 regression guard. Five load-bearing strings: OpenClaw, Hermes, ZE, production-numbers regex, P@5/R@5. Light anchors that let voice/ structure evolve but block accidental loss of headline facts. scripts/check-test-real-names.sh: allowlist entries for OpenClaw + Hermes literals in the anchor test (it explicitly asserts those strings appear in README). * chore: bump version and changelog (v0.36.0.0) ZeroEntropy as the new default for embedding (zembed-1 at 1280d via Matryoshka) and reranker (zerank-2 cross-encoder, on by default in balanced mode bundle). README zero-based rewrite (884 -> 139 lines). 3 new docs files. Two new doctor checks. New gbrain ze-switch CLI with --undo for symmetric reversibility. skills/migrations/v0.36.0.0.md tells the agent how to surface the retrieval-upgrade prompt post-upgrade. llms-full.txt regenerated via bun run build:llms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): scrub Wintermute from RETRIEVAL.md per privacy rule * chore: rebump version 0.36.0.0 → 0.36.2.0 (queue collision) Three open PRs were claiming v0.36.0.0 (garrytan#1130 skillpack, garrytan#1139 hindsight, garrytan#1136 this PR). Ship-aware queue allocator says this branch lands at v0.36.2.0. Trio audit: VERSION 0.36.2.0 package.json 0.36.2.0 CHANGELOG ## [0.36.2.0] - 2026-05-17 Updates: VERSION, package.json, CHANGELOG header + body refs, README "New default in v0.36.2.0" announcement + credit line, skills/migrations/v0.36.0.0.md renamed to v0.36.2.0.md with frontmatter + body refs updated. llms-full.txt regenerated. * fix(test): pin gateway dim=1536 in cross-file-stateful PGLite tests CI shard 1 reported 10 failures across `query-cache.test.ts` (6) and `consolidate-valid-until.test.ts` (4). Both files hardcode 1536-dim vectors but rely on `PGLiteEngine.initSchema()` to size `vector(__EMBEDDING_DIMS__)` at the right width. Root cause: v0.36.2.0 flipped DEFAULT_EMBEDDING_DIMENSIONS from 1536 to 1280 (ZE Matryoshka step). The gateway module is process-singleton; when ANOTHER test file in the same shard's bun-test process configures the gateway before us, `pglite-engine.ts:216` reads `getEmbeddingDimensions() === 1280` and sizes the schema columns at vector(1280). The hardcoded 1536-dim INSERTs then fail with "expected 1280 dimensions, not 1536". Locally these tests pass in isolation because the gateway falls back through the try/catch at pglite-engine.ts:218 (1536 default). CI runs multiple test files in one process, so cross-file state poisons the schema width. Fix: explicit `resetGateway()` + `configureGateway({embedding_dimensions: 1536, ...})` at the top of `beforeAll`, plus `resetGateway()` in `afterAll`. Pins the schema width regardless of cross-file state. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…an#1164) * feat: migration v68 — eval_candidates.embedding_column Schema migration ALTERs eval_candidates to add a nullable embedding_column TEXT column. Per-row capture metadata so `gbrain eval replay` reproduces the same column the capture ran against (D16 / CDX-10). NULL-tolerant: pre-v0.36 rows fall back to current default. Renumbered v67→v68 because master claimed v67 for facts_typed_claim_columns during this branch's lifetime. PGLite parity via sqlFor.pglite — same ALTER IF NOT EXISTS. * feat: dynamic embedding column — core (resolver, types, gateway, engines) The read-path foundation for routing search through any populated embedding column, not just OpenAI 1536. src/core/search/embedding-column.ts (new) is the canonical seam. Single source of truth for column → provider/dim/type lookup. Validates registry keys via regex (/^[a-z_][a-z0-9_]*$/), uses Object.create(null) + Object.hasOwn so 'constructor' and other inherited names can't masquerade as registered columns. Identifier-quoting on SQL interpolation as defense in depth. src/core/types.ts widens SearchOpts.embeddingColumn to accept ResolvedColumn descriptors at the engine boundary; adds EmbeddingColumnConfig + ResolvedColumn exports. src/core/config.ts merges embedding_columns + search_embedding_column from the DB plane via loadConfigWithEngine, mirroring the existing embedding_multimodal_model pattern. Handles the no-file case so env-only Postgres installs see DB-plane overrides (codex /ship #3). src/core/ai/gateway.ts: embedQuery(text, opts) + embed(texts, opts) accept embeddingModel + dimensions overrides. isAvailable(touchpoint, modelOverride?) so hybrid asks 'is the active column's provider reachable?' not 'is the global default reachable?' (CDX-4 / D10). Engines: searchVector accepts ResolvedColumn descriptors via normalizeEngineColumn; engine code is config-free and unit-testable. getEmbeddingsByChunkIds(ids, column?) so cosineReScore hydrates from the active column instead of always 'embedding' (CDX-3 / D9). Identifier-quoting belt at the SQL boundary. src/core/eval-capture.ts threads embedding_column from hybridSearch meta into the persisted capture row. * feat: dynamic embedding column — integration (hybrid, ops, doctor) Wires the resolver into hybridSearch, the query op, doctor, and the config command. src/core/search/hybrid.ts: resolves the column once at the boundary, threads the descriptor into engine calls, routes embedQuery through the resolved column's provider/dims, and calls isCacheSafe (not isDefaultColumn) for cache skip so user overrides of the 'embedding' builtin can't leak across vector spaces (CDX-4). cosineReScore now hydrates from the active column. src/core/search/mode.ts: KNOBS_HASH_VERSION 2→3, append-only new fields col= and prov= alongside floor_ratio. Cache rows from different columns or providers now sit in different keyspaces — cross-column contamination impossible. src/core/operations.ts: query op accepts embedding_column param for per-call A/B benchmarking. search op (keyword-only) deliberately does NOT (CDX-9 / D15) — would be silent UX. src/commands/doctor.ts: new embedding_column_registry check. Batch format_type probe (D13) catches dim drift that information_schema.columns.udt_name can't. Batch pg_indexes probe (D5) warns on missing HNSW. Coverage % on active column, gates at <90% (D14), short-circuits on empty brains (codex /ship #5). src/commands/config.ts: validates embedding_columns JSON shape at set time, runs the coverage gate when setting search_embedding_column, uses Object.hasOwn for the registry lookup. src/commands/eval-replay.ts: replay re-runs queries against the captured embedding_column so post-flip-config replays don't surface as false-positive regressions. * test: dynamic embedding column — unit + e2e coverage 50 unit cases for the resolver (resolution chain, registry merge, validation, prototype pollution, descriptor passthrough, isCacheSafe, normalizeEngineColumn). 8 gateway override cases — embeddingModel + dimensions flow into providerOptions, isAvailable(touchpoint, override) routes to the right recipe, unknown models throw clean. 4 cosineReScore + 6 ops + 5 knobs-hash + 7 mode + 9 PGLite E2E + 7 Postgres E2E + 5 eval-replay column metadata. Postgres E2E (gated on DATABASE_URL) covers halfvec(2560) end-to-end on real pgvector, EXPLAIN-visible HNSW index on the alternate column, format_type-based dim drift catch, and the <90% coverage gate. Pins every codex /ship fix: prototype-pollution rejection ('constructor' as column name), descriptor passthrough validation (rejects SQL-shaped strings in dimensions), isCacheSafe semantics (space-based, not name-based). Total: 141 new + extended cases, all green. * chore: bump version and changelog (v0.36.3.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync to v0.36.3.0 Add CLAUDE.md key-files entry for src/core/search/embedding-column.ts. Annotate hybrid.ts, gateway.ts, doctor.ts, and migrate.ts entries with v0.36.3.0 wave changes (ResolvedColumn threading, embedQuery model override, embedding_column_registry check, migration v68). Document knobs_hash v=2 → v=3 bump under the Search Mode section. Regenerate llms-full.txt from the updated CLAUDE.md so the auto-checked bundle matches source (build-llms.test.ts CI guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): two CI failures from v0.36.3.0 1. test/loadConfig-merge.test.ts: update the 'returns null when base config is null' contract test. Pre-v0.36 the function returned null for null base; the codex /ship #3 fix changed that to synthesize a minimal `{ engine: 'postgres' }` so env-only installs see DB-plane overrides. Test now pins the new contract + adds a round-trip case asserting the merge actually surfaces `embedding_columns` / `search_embedding_column` set via gbrain config set on a null base. 2. test/schema-bootstrap-coverage.test.ts was failing because eval_candidates.embedding_column (added by migration v68) wasn't covered by applyForwardReferenceBootstrap. Fix: add the column to PGLITE_SCHEMA_SQL's eval_candidates CREATE TABLE definition (and src/schema.sql for parity) so fresh installs get it natively. The coverage test's third tier (schemaCreateTableCols) now finds it. Regenerated schema-embedded.ts via bun run build:schema. Schema-blob path is cleaner than COLUMN_EXEMPTIONS — fresh installs skip the migration entirely; upgrade installs still run v68. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stale refs (garrytan#1201) A community member reported docs 'have quite a bit of drift and some broken links' and contradictions like 'says don't use bun but also to use bun.' This PR is a top-to-bottom audit + fix across every doc file at the repo root and under docs/. Where docs disagreed with each other, the code was the tie-breaker. ## Categories of fix ### 1. Stale CLI commands (skillpack install → scaffold) `gbrain skillpack install` was retired in v0.36.0.0 (replaced by the scaffold/reference/migrate-fence model). The CLI now errors out with a hint: $ gbrain skillpack install Error: 'gbrain skillpack install' was removed in v0.33. Use 'gbrain skillpack scaffold <name>' instead. But the docs still recommended it: - README.md line 29 — primary install path - docs/INSTALL.md lines 12 — primary install path Both updated to `gbrain skillpack scaffold --all` with the v0.36.0.0 retirement explained inline + the migrate-fence escape hatch for users upgrading from older releases. ### 2. The 'bun install -g vs bun link' contradiction The community member's exact complaint. The drift: - README.md + docs/INSTALL.md: recommended `bun install -g github:garrytan/gbrain` - INSTALL_FOR_AGENTS.md line 29: 'Do NOT use `bun install -g github:garrytan/gbrain`.' Reading the code + CHANGELOG: `bun install -g` IS the canonical path. Bun occasionally blocks the top-level postinstall hook on global installs (issue garrytan#218), but the postinstall now prints a loud recovery hint when that happens, and `gbrain doctor` flags `schema_version: 0` and routes users to `gbrain apply-migrations --yes`. The 'do not use' warning was correct in 2024 when the postinstall silently swallowed errors with `|| true`; it's stale now. Reconciled: - INSTALL_FOR_AGENTS.md Step 1: now recommends `bun install -g` as the primary path, documents garrytan#218 as a known issue with the recovery command, and keeps `git clone + bun link` as a documented fallback. - AGENTS.md Install (5 min): same reconciliation; clone path is the fallback, not the default. - docs/INSTALL.md CLI standalone: added the garrytan#218 callout so the deterministic fallback is one click away when the default fails. ### 3. Broken internal links - README.md → `docs/integrations/voice.md` (file doesn't exist). The real voice recipe lives at `recipes/twilio-voice-brain.md` (Twilio + OpenAI Realtime). Fixed to point there with an accurate one-line summary. - CONTRIBUTING.md → `docs/SQLITE_ENGINE.md` (file doesn't exist; superseded by PGLite per docs/ENGINES.md). Replaced with a paragraph explaining the supersession and pointing at the live ENGINES.md. - docs/GBRAIN_V0.md → `docs/SQLITE_ENGINE.md` (2 references; same supersession). Added a historical-doc banner at the top + rewrote both references to point at the current ENGINES.md. ### 4. Stale API key recommendations INSTALL_FOR_AGENTS.md Step 2 only mentioned OpenAI + Anthropic. As of v0.36.2.0 ZeroEntropy is the default embedding + reranker stack (README opens with this); the agent install guide didn't reflect it. Added `ZEROENTROPY_API_KEY` as the default, kept OpenAI/Voyage as documented fallbacks, noted that keys can live in `~/.gbrain/config.json` (file plane) or env. ### 5. Stale upgrade workflow INSTALL_FOR_AGENTS.md 'Upgrade' section assumed the clone+bun-install model (`cd ~/gbrain && git pull && bun install && gbrain init && gbrain post-upgrade`) and didn't mention `gbrain upgrade` (the single-command path that exists in the CLI today: binary self-update + schema migrations + post-upgrade prompts in one). Split into two paths — `gbrain upgrade` for the bun-install-g case (now the default per Step 1), clone-path for the fallback case. Also fixed AGENTS.md 'Migrate' bullet (was `gbrain apply-migrations` only; now leads with `gbrain upgrade` and keeps apply-migrations as the manual schema-only path). ### 6. Stale cron-workflow INSTALL_FOR_AGENTS.md Step 7 referenced cron docs but didn't mention `gbrain autopilot --install` (the built-in self-maintaining daemon that exists in the CLI today) or `gbrain sync --watch` (continuous loop). Added both as alternatives to platform-cron glue. ### 7. ZeroEntropy version typo docs/INSTALL.md said 'the v0.36.0.0 ZE switch' — ZE landed in v0.36.2.0 (v0.36.0.0 was the skillpack-scaffold retirement). Fixed. ## What I did NOT change - CHANGELOG.md, CLAUDE.md, TODOS.md prose mentions of historical commands like `gbrain skillpack install` are correct as history — they're documenting what was true in past releases. Only forward-looking docs got updated. - The 'broken link' false-positive matches in CHANGELOG / CLAUDE / TODOS are inside code-fence examples or regex patterns (`[Name](people/slug)`, `[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])`, `[--json](interrupted)`); they're illustrative syntax, not real links. Leaving alone. - llms.txt / llms-full.txt regenerated via `bun run build:llms` so the agent-fetch documentation map matches the new content. ## Verification - `bun run src/cli.ts --help` cross-checked against every command/flag the install docs reference: init, doctor, apply-migrations, upgrade, post-upgrade, skillpack scaffold/reference/migrate-fence, embed --stale, sync --watch, autopilot --install, dream, integrations list, extract links/timeline, graph-query, query, search modes — all real, all current. - `bun run src/cli.ts skillpack install` confirmed to error out with the retirement hint pointing at scaffold (proves the README guidance was actively misleading users into a dead-end). - Re-ran the broken-internal-link scanner across all root .md + docs/**/*.md; zero real broken links remain (5 residual matches are illustrative syntax inside prose, not actionable links). Co-authored-by: garrytan-agents <agents@garrytan-agents.local>
…--remediate + Minions (garrytan#1193) * feat(schema): op_checkpoints table + doctor_run_id partial GIN (v67+v68) T1 of brain-health-100 wave. Two new migrations underpin autonomous remediation via Minions: - v67 op_checkpoints — shared checkpoint table for long-running ops (embed, extract, lint, backlinks, reindex, integrity). Pre-fix each op had its own file-backed checkpoint or none. PRIMARY KEY (op, fingerprint) lets `extract links` and `extract timeline` (or `reindex --markdown` vs `--code`) coexist without colliding on shared keys. - v68 minion_jobs_doctor_run_id_idx — partial GIN on `minion_jobs.data WHERE data ? 'doctor_run_id'`. Indexes only doctor-submitted jobs so audit-trail queries don't sequential-scan months of unrelated cron history. PGLite skips via empty sqlFor. Applied to src/schema.sql + src/core/pglite-schema.ts so both engines get the table on fresh-install. Bootstrap coverage test + 122-case migrate test both pass. Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (D12 + folded scope B from outside-voice review). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(core): op-checkpoint module — DB-backed checkpoint primitive T2 of brain-health-100 wave. Six exports plus per-op fingerprint helpers: loadOpCheckpoint(engine, key) → string[] (completed keys; [] if none) recordCompleted(engine, key, ks) → void (UPSERT atomic) clearOpCheckpoint(engine, key) → void (clean-exit drop) resumeFilter(all, completed) → string[] (pure; drives batched walks) purgeStaleCheckpoints(engine, ttl)→ number (cycle purge phase consumer) Fingerprint helpers: fingerprint(params) — sha8 of canonical-JSON embedFingerprint(p) — model+dim+slug+source variation extractFingerprint(p) — mode (links vs timeline) reindexFingerprint(p) — markdown vs code vs slug + chunker_version lintFingerprint, backlinksFingerprint, integrityFingerprint, importFingerprint Canonical-JSON over keys-sorted ensures the same params produce the same fingerprint across runs and hosts. sha8 (8 hex chars from sha256) is short enough for filenames + UI but collision-resistant for the expected per-op invocation diversity. DB-backed for both engines (PGLite has the table too via v67). Lost- write on partial DB failure is non-fatal — caller continues, next run re-walks (cheap for hash-short-circuited ops like embed/import). Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (D12 + codex #10–16 from outside-voice review). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(core): brain-score-recommendations — shared data layer T4 of brain-health-100 wave. Pure module — no engine I/O. Takes a BrainHealth snapshot + RecommendationContext, returns ordered Remediation[] ready to feed the doctor remediation plan OR features --auto-fix. Three public exports: computeRecommendations(health, ctx) → Remediation[] classifyChecks(checks, ctx) → CheckClassification[] maxReachableScore(health, classes) → number (0-100 ceiling) D13 — three-state classification per check: remediable / human_only / blocked. The plan ONLY emits remediable items; blocked surfaces alongside as informational with the missing prereq (no API key, etc.). Closes the spin-loop bug on empty / API-key-missing brains (codex #20). D14 — every Remediation has a stable string id (sync.repo, embed.stale, backlinks.fix, extract.all). depends_on references ids, not check names. D9 — idempotency_key is content-hash from canonical-JSON of params. Same intent across runs = same key; failed-row replay via :r<N> suffix is the --remediate loop's job, not this module's. Scope item +A (cost-budget gate) — Remediation.est_usd_cost populated for embed (chars × pricePerMTok from embedding-pricing.ts) and Anthropic jobs (estimateAnthropicCost helper). doctor --remediate --max-usd N gates submission against est_total_usd_cost. Both consumers (doctor + features per D15) import from here. Features executes inline (D15 contract preserved), doctor submits via queue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(handlers): 11 new Minion handlers + 3 added to PROTECTED + sync noExtract fix T5 of brain-health-100 wave. PROTECTED_JOB_NAMES extension (D11): synthesize, patterns, consolidate. These cycle phases internally submit `subagent` jobs with allowProtectedSubmit=true, so they CAN spend Anthropic credits. Treating them as "data-quality maintenance" was a misread surfaced by the codex outside-voice review (#6). Protected gate ensures only trusted local callers (CLI, autopilot, doctor --remediate) can submit; an OAuth-scoped MCP client can't burn the user's API budget by submitting a synthesize job over HTTP. 11 new handlers registered in jobs.ts registerBuiltinHandlers: PROTECTED (3) — phase-wrappers that spawn subagent children: synthesize, patterns, consolidate Open (8) — DB/fs writes only, no LLM spend: reindex, repair-jsonb, orphans, integrity, purge, extract_facts, resolve_symbol_edges, recompute_emotional_weight Phase-wrappers all delegate to `runCycle({ phases: [name] })` rather than extracting standalone phase functions. Cycle.ts already owns the lock + abort signal + progress reporter per D10, so the wrapper is a one-liner and cycle.ts remains the single source of truth for phase semantics. Pragmatic deviation from the plan's "extract 6 standalone runXxxPhase functions" — smaller diff, equivalent correctness. Standalone `sync` handler now passes `noExtract: true` (codex #5 fix). Pre-fix, doctor's remediation plan emitting [sync, extract] caused double-extraction (performSync inline-extract + standalone extract job). Now sync defers extract to the dedicated handler. Callers that want inline extract pass { noExtract: false } in job params. Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (T5 + D10 + D11 + codex #5/#6 from outside-voice review). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(doctor): --remediation-plan + --remediate CLI surfaces T6 of brain-health-100 wave. The headline user-facing capability: agents drive brain health to target score via autonomous Minions remediation. Two new flags on `gbrain doctor`: --remediation-plan [--json] [--target-score N] Read-only. Emits ordered Remediation[] from BrainHealth + context. Uses cheap path (D7) — engine.getHealth() + computeRecommendations, NOT a full doctor walk. JSON shape is stable agent contract. --remediate [--yes] [--target-score N] [--max-jobs N] [--max-usd N] [--dry-run] [--json] Sequential submit (D3) with D5 cascade on failure, D7 scoped recheck between steps, D9 content-hash idempotency keys, D13 three-state remediation filtering (only remediable jobs enter the loop), +A cost-budget gate via --max-usd. Check.remediation field added as additive optional (DoctorReport schema_version stays at 2 per D4). PGLite path: synchronous in-process execution with short polling. Postgres path: durable queue submission with waitForCompletion. The --remediate loop: 1. Compute initial plan from BrainHealth 2. Refuse if --target-score > maxReachableScore(health, classes) 3. Refuse if est_total_usd_cost > --max-usd 4. For each step in order: - Skip if depends_on intersects aborted set (D5) - queue.add with content-hash idempotency_key (D9) - waitForCompletion with timeout - Recompute plan from fresh health (D7 scoped recheck) 5. Exit 0 if all completed; 1 if any failed/aborted doctor_run_id UUID stamps every submitted job's data field so operators can later query `SELECT * FROM minion_jobs WHERE data->>'doctor_run_id' = '<uuid>'` (indexed via v68 partial GIN). Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (T6 + D1/D3/D5/D7/D9/D13 + folded scope A). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): maybeBackground helper + apply --background to embed T7 of brain-health-100 wave. New helper in src/core/cli-options.ts formalizes the --background flag pattern. Same semantics in TTY and cron per D9 (submit-and-exit always; --background --follow execs `gbrain jobs follow <id>` after submission). await maybeBackground({ engine, args, jobName: 'embed', paramBuilder: (cleanArgs) => ({ stale, all, ... }), }) // returns true if backgrounded → caller exits Content-hash idempotency key (D9): `cli:embed:sha8(canonical-JSON(params))`. No time-slot. Same intent across runs = same key. Failed-row replay is the doctor --remediate loop's job, not this path's. PGLite degrades to inline execution with a clear stderr note ("PGLite has no worker daemon; running inline"). NOT a no-op, NOT silent — doc-stated semantic difference because PGLite has no worker daemon. Applied to `gbrain embed` as the reference integration. The other 6 commands (extract, lint, backlinks, reindex, integrity, pages) adopt the same 4-line pattern at the top of their entry function — follow-up in a smaller diff once the helper proves out in production. Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (T7 + D9 + Gap 6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(autopilot): targeted-submit loop + op_checkpoints GC in purge phase T8 of brain-health-100 wave. Autopilot dispatch changes (src/commands/autopilot.ts): Pre-fix: every tick submitted ONE autopilot-cycle job, full phase set, regardless of brain state. On a healthy brain pure overhead; on a degraded brain bundled fast wins with slow phases so user waited for the slowest. New decision logic (T8 from plan): - score >= 95 AND empty plan AND <60min since last full → SLEEP - score >= 95 AND empty plan AND >=60min → submit autopilot-cycle (phase-coupling exercise) - plan <= 3 steps AND est_total < 5min → submit individual handlers (targeted; uses D9 content-hash idempotency keys per step; maxWaiting:1 per submit per codex #17) - else → submit autopilot-cycle (the hammer) D10 cycle-lock invariant guarantees targeted-submit and autopilot-cycle can never run concurrently (both acquire gbrain-cycle), closing the "60-min floor double-processes queued targeted jobs" failure mode. Computation uses cheap path (D7) — engine.getHealth() + computeRecommendations, NOT a full doctor walk. Adds ~1 SQL count query per tick; negligible on a 50K-page brain. PROTECTED handlers (synthesize/patterns/consolidate) are submitted with allowProtectedSubmit:true; autopilot is a trusted local caller. Cycle purge phase (src/core/cycle.ts): Added op_checkpoints GC (+C folded scope item). 7-day TTL — any reasonable long-running op finishes inside that window. Non-fatal on pre-v67 brains (table missing). Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (T8 + D7/D9/D10 + codex #17 + folded scope +C). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(core): brain-score-recommendations + op-checkpoint unit tests T10 of brain-health-100 wave — load-bearing decision-pinning tests. test/brain-score-recommendations.test.ts (22 cases): - Healthy brain → empty plan - Per-component remediation paths (sync, embed, backlinks, extract) - depends_on wiring (extract → sync; embed → sync when stale) - Severity ordering (critical > high > medium > low) - D6 #5 determinism: same input twice → byte-identical output - D9 idempotency keys: content-hash format, no time-slot - D9 source isolation: different --source → different key - D13 status field always 'remediable' in output - +A cost-estimate populated for embed - classifyChecks: remediable / blocked / human_only triage - maxReachableScore: all-remediable → 100; all-blocked → current test/op-checkpoint.test.ts (20 cases): - fingerprint stability + key-order invariance (canonical-JSON) - codex #11: extract links vs timeline get different fingerprints - codex #12: reindex markdown vs code get different fingerprints - codex #15: embed model+dim variation produces different fingerprints - reindex chunker_version bump invalidates checkpoint - DB round-trip (load → record → load) - Cross-fingerprint isolation (linksKey vs timelineKey) - clearOpCheckpoint idempotency on missing rows - resumeFilter purity (no I/O, deterministic) - purgeStaleCheckpoints TTL respect 42 new tests, all pass. PGLite engine + resetPgliteState pattern per CLAUDE.md test-isolation guide. Plan: ~/.claude/plans/system-instruction-you-are-working-fluttering-ocean.md (T10 + D6 #5 + D9 + D12 + D13 + codex #11/#12/#15). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v0.36.0.0 — brain-health-100 wave + docs/llms refresh T12 of brain-health-100 wave. VERSION + package.json bumped 0.35.6.0 → 0.36.0.0. CHANGELOG entry leads ELI10 ("your agent can now drive your brain to 90/100 by itself, on a cron, without you watching") then drills into the precise mechanics per CLAUDE.md voice rules. llms.txt + llms-full.txt regenerated via bun run build:llms. Trio audit (CLAUDE.md mandatory pre-push check): VERSION: 0.36.0.0 package.json: 0.36.0.0 CHANGELOG: ## [0.36.0.0] - 2026-05-18 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update README/CLAUDE/AGENTS/maintain for v0.36.4.0 brain-health-100 wave - README.md: New-in-v0.36.4.0 callout — `gbrain doctor --remediate` headline, autopilot health-aware tick, eleven new background-job types, three PROTECTED. - CLAUDE.md: Key Files entries for `op-checkpoint.ts`, `brain-score-recommendations.ts`, doctor.ts / jobs.ts / protected-names.ts / autopilot.ts / cycle.ts / embed.ts / cli-options.ts extensions; new "Key commands added in v0.36.4.0" section. - AGENTS.md: Common-tasks entry pointing agents at the one-command remediation loop. - skills/maintain/SKILL.md: Autonomous Phase (gbrain doctor --remediate) at the top, manual per-dimension walk preserved as the fallback path. - llms-full.txt: regenerated to pick up the CLAUDE.md changes (project rule). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(changelog): respectful tone on spend caps for v0.36.4.0 Reframed the cost-budget callout. Pre-fix language said the spend cap prevents a synthesize loop from "burning $100 of Anthropic credits while you're at lunch" — casually treating $100 as the throwaway number is tone-deaf. $100 is a meaningful amount for many people. New language: "spend cap so a synthesize loop can't run up your Anthropic bill while you're at lunch. The cap is yours to set per run." And: "Pass --max-usd 5 (or whatever cap you're comfortable with)." And: "Pick the cap that fits your wallet." Also reframed three adjacent lines: - "healthy brains stop burning cycles" → "stop spending tokens on work that has nothing to do" - "agent can't submit them and burn your API budget" → "can't submit them on your behalf. Your provider bill stays in your hands" - Table cell "Cron with cost cap" / "--max-usd 5" → "Cron with spend cap" / "--max-usd N" llms-full.txt regenerated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…"database_url"]) (garrytan#1192) * v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) Replaces PR garrytan#1137's plaintext-config / plaintext-env workarounds with code. Shell-job params gain `inherit: ["database_url"]`, validated pre-enqueue in both the CLI (`gbrain jobs submit`) and `submit_job` MCP op handler. Worker resolves the value from its own loadConfig() at child-spawn time; the persisted `minion_jobs.data` row stores only the name. Plain `env: { GBRAIN_DATABASE_URL: ... }` / `env: { DATABASE_URL: ... }` / `env: { GBRAIN_DIRECT_DATABASE_URL: ... }` are rejected pre-enqueue with a paste-ready hint pointing at `inherit:`. Codex pre-landing review caught two bypasses + one missing shadow name: - H1: cmd/argv inline-secret regex scan (cmd:"GBRAIN_DATABASE_URL=... gbrain sync" was a clean bypass — fixed) - H3: GBRAIN_DIRECT_DATABASE_URL added to shadowKeys - H2: honest docs about output-side leakage (stdout_tail/stderr_tail can still carry the value if the script prints it; that's the script author's responsibility, not gbrain's) Also: gbrain doctor learns home_dir_in_worktree (warns when ~/.gbrain lives inside a git worktree); ~/.gbrain/.gitignore retroactive via saveConfig + post-upgrade. New canonical guide: docs/guides/agent-to-gbrain.md (two-domain framing for downstream agent authors: MCP ops via OAuth vs localOnly admin ops via shell-job inherit:). Closes garrytan#1137. Tests: +53 new (21 validator + 12 inherit-record + 6 ensureGitignore + 5 doctor + 2 PGLite E2E + 7 codex-driven H1/H3 cases). Credit: @WinterMute filed PR garrytan#1137 which made the env-stripping gap visible enough to fix in code. Thank you. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * v0.36.5.0 redesign: free-form inherit:, drop closed enum User feedback: "agent spawning minions should have agency to do what it wants with secrets and pass only the ones that it needs. don't be a security nazi please." Replaces the closed INHERITABLE enum (database_url only) with three small helpers in shell-inherit.ts: - INHERIT_NAME_RE: snake_case shape guard. Rejects __proto__, leading underscore, uppercase, path-traversal. Prototype-pollution defense. - deriveEnvKey(name): config-key → child-env-key. Uppercase by default with one override: database_url → GBRAIN_DATABASE_URL. - resolveInheritValue(cfg, name): value lookup with Object.hasOwn. inherit: now accepts any snake_case config-key the worker has. Agent picks what it needs per-job (database_url, anthropic_api_key, voyage_api_key, or any custom field). Validator does NOT police WHICH keys — single-uid trust model treats agent as peer of worker. Drops the v0.36.5.0-RC rules that were paternalistic for the actual threat model: - closed-enum check - env-shadow rejection - cmd/argv inline-secret scan Keeps the parts that defend real problems: - pre-enqueue validation (closes the persistence-before-throw window) - snake_case regex (prototype-pollution + audit-log readability) - fail-fast on missing config value (UX guardrail, not security) Tests: shell-validate (existing rules + new free-form + prototype-pollution defense + T1 regression guard) and shell-inherit (regex matrix, deriveEnvKey per-name, resolveInheritValue with hasOwn defense). E2E case now exercises inherit:["anthropic_api_key"] to prove genuinely free-form. Docs and CHANGELOG rewritten to reflect the open design + the design-arc story (closed → cut → free-form). Migration file too. 7653 unit tests green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * v0.36.5.0 add: redact_secrets opt-in for stdout/stderr scrubbing Honest defense for the documented output-side leakage. When a script prints an inherited secret, the value lands plaintext in result.stdout_tail / result.stderr_tail / error_text. v0.36.5.0 adds: - `redact_secrets: true` ShellJobParams field - `--redact-secrets` CLI convenience flag on `gbrain jobs submit shell` - shell-redact.ts: pure `redactSecretsInText(text, secrets)` helper (string-mode replaceAll; regex metachars in values stay literal) - Handler post-processes both tails before throw/return, so the persisted row carries `<REDACTED:name>` tokens instead of values Only inherit-resolved values are scrubbed. env: values are not (those are the agent's "fine in the row" channel by design). Heuristic — defeats accidental `echo "$GBRAIN_DATABASE_URL"`, not adversarial encode-then-print. Default false for back-compat. Tests: - test/minions-shell-redact.test.ts (9 cases): pure-function behavior, regex-metachar safety, multi-secret independent redaction, substring overlap, empty-input/map edge cases - test/minions-shell-validate.test.ts: +4 cases for redact_secrets shape - test/e2e/minions-shell-pglite.test.ts: +2 cases proving redact_secrets: true scrubs persisted row AND redact_secrets:false preserves plaintext (back-compat regression guard) Docs + CHANGELOG + migration file + CLAUDE.md updated. 7667 unit tests green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
|
Important Review skippedToo many files! This PR contains 296 files, which is 146 over the limit of 150. To get a review, narrow the scope: ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (296)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
This PR catches Eva Brain up to upstream GBrain
v0.36.5.0(garrytan/gbrain@e2279650) while keeping Eva's downstream product contract intact: OpenClaw-native install,/plugins/gbrain/extract, Codex/OAuth extraction posture, support-KB setup, safe updater behavior, and Voyage 4 Large 2048d defaults.Closes #103.
Why This Is The Right Shape
Eva is healthiest as a thin fork. Upstream should own GBrain's database/search/sync/doctor/skillpack core as soon as merged upstream code is better or broadly useful. Eva should keep only the pieces upstream does not yet provide for our users: OpenClaw plugin packaging, no-key host runtime extraction, public install/update helpers, and defaults that match our fleet.
Upstream v0.36.x Accepted
inherit: ["database_url"].Eva Product Surface Preserved
plugins/openclaw-gbrainand/plugins/gbrain/extractremain the extraction boundary.CodexExtractionClientstill sends prompt/media payloads without API keys, OAuth tokens, or refresh tokens.import-mediaandingest-media --extract openclawremain as transitional media bridge commands.plugins/gbrain-codexand install docs remain repo-owned..gbrain/gbrain.env, provider-auth seam, and advisory-onlypostinstallremain intact.gbrain ze-switch, but Eva defaults remainvoyage:voyage-4-largeat2048dso fresh installs do not silently size themselves for the wrong fleet provider.Conflict Decisions Worth Reviewing
src/core/ai/gateway.tsandsrc/core/search/embedding-column.ts: accepted upstream dynamic column work, but restored Eva's Voyage 2048 fallback defaults.src/commands/upgrade.ts: accepted upstreamensureGitignore()hygiene, fixed merge artifact that redeclaredupgradeFrom, and preserved Eva source-install updater behavior.openclaw.plugin.json: unioned upstream skillpack-harvest wiring with Eva plugin defaults and bumped manifest version to0.36.5.0.src/commands/sync.ts: kept upstream walker/no-origin fixes and preserved source-scoped auto-embed behavior.CHANGELOG.md: redacted deployment-specific path literals so the privacy gate stays strict.Local Validation
Ran from
/Volumes/LEXAR/repos/eva-brain-worktrees/eva-merge-upstream-v0.36.5.0:Results:
149 pass, 0 fail.25 pass, 0 fail.bun run verify: passed, including privacy checks, admin build, admin scope drift, wasm embedded check, system-of-record check, synthetic corpus privacy, andtsc --noEmit.git diff --check: clean.Confidence
I am >95% confident this is the right catch-up path because the PR accepts upstream's merged core improvements, preserves every known Eva product boundary, and has focused local coverage over the exact collision surfaces: provider defaults, dynamic embedding columns, source sync, OpenClaw extraction, media ingestion, install docs, and privacy/static gates. Full-suite confidence should come from GitHub Actions rather than a heavy local run.