Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100
Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100dfrostar wants to merge 14 commits into
Conversation
3–5 day scoped plan for measuring whether L0/L1/L2/L3 compressed context preserves answer faithfulness vs. full source. Colocated with the future runner under evals/faithfulness/. Plan only — no implementation.
Three subsystems: (A) selector auto-tuning of L2 recall and L3 escalation thresholds, (B) L2 community-summary refinement with A/B promotion, (C) closed-loop eval-driven tuning that depends on the faithfulness study landing first. Plan reflects the actual MCP/selector architecture: drill events are really query events, L0/L1 are computed not stored so refinement only applies to L2, and all three subsystems share one query_events table in the existing synapse SQLite store. D1 delivers only the logging substrate (no behavior change). A and B ride on top.
Extends the existing memory.py opt-in logger rather than building a new substrate (the plan was over-spec'd; memory.py already does most of it). - log_wakeup_event() — new event type, parallel to log_query_event, written to the same JSONL files. Wired into core.NeuralMind.wakeup(). Needed for the "L0/L1 was sufficient" positive signal: a wakeup with no follow-up query in the same session. - session_id on both event types via _current_session_id() — honors CLAUDE_SESSION_ID, falls back to a stable per-process uuid. Required for intra-session signals (re-query, wakeup-without-followup). - Aggregation helpers: read_events (no event_type filter), recent_events, escalation_rate, re_query_rate, wakeup_only_rate. These are the read-side primitives subsystem A (selector autotuning) will consume. No behavior change for users without the consent sentinel — all logging remains gated by is_memory_logging_enabled(). Plan doc updated to reflect the discovery that the substrate already exists, and to drop the new-table proposal in favor of extending memory.py.
The meta table existed but only decay() wrote to it via raw SQL. The selector auto-tuner (D2) needs a clean key-value read/write path to persist l2_recall_k across sessions. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
ContextResult.layers_used elements are decorated by the selector
("L3:Search(4 results)", not "L3"), so the bare `"L3" in layers_used`
membership check never matched real events — escalation_rate always
returned 0.0. The D1 tests used unrealistic fixtures so the bug passed
review.
Match by "L3" prefix instead. re_query_rate (keys off
communities_loaded) and wakeup_only_rate (keys off event_type) are
unaffected. Updates the escalation test fixtures to realistic decorated
strings and adds a regression case.
https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
tune_selector() reads logged query events, computes re_query_rate over the events since the last tune, and adjusts l2_recall_k (L2 community recall count) up or down by one step, persisting it to the synapse store's meta table. The rule is deliberately conservative: a 50-event warm-up gate, a dead band between the two thresholds, single-step moves bounded to [2, 6], and event windowing keyed off l2_recall_k_tuned_at so the tuner doesn't chase a distribution it just perturbed. re_query_rate is a weak signal and the thresholds are provisional — subsystem C (eval-driven tuning) will replace the hand-tuned rule later. tune_selector never raises (fail-open, safe for the hook path). selector_report() is the read-only view for the CLI. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
ContextSelector gains an l2_recall_k constructor arg (defensively clamped) that get_context forwards to get_l2_context as max_communities. core.py reads the per-project tuned value from the synapse meta table before constructing the selector, falling back to the default if it's absent or unreadable — the read path is fully guarded so context selection never hard-fails on a missing DB. Read once at construction rather than per-query: the value changes at most once per session (in the SessionStart tuner tick), so a per-call DB read would be pure overhead. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
The SessionStart hook now calls tune_selector() after the decay tick, gated on NEURALMIND_SELECTOR_AUTOTUNE=1 (opt-in, since it is net behavior change). One JSONL read plus at most one meta write — stays fast — and tuner failures are swallowed so the hook always returns 0. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
`neuralmind self-improve status [project]` prints the selector tuner state: current l2_recall_k, when it was last tuned, event counts, windowed re_query_rate, and whether autotune is enabled. Read-only, --json supported. Nested under a `self-improve` parser so future self-improvement subsystems can attach their own sub-actions. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
The window is keyed off l2_recall_k_tuned_at, so immediately after a tune — before fresh events arrive — it is empty. re_query_rate([]) returns 0.0, which the rule misread as "low re-query rate → lower k", causing the tuner to undo its own previous move on the very next tick. Add a WINDOW_MIN_EVENTS (20) guard: an empty/thin window means "no signal, hold", not "rate is zero". Caught by the manual round-trip in D2 verification. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
D2 shipped subsystem A scoped to l2_recall_k only. Documents the re_query_rate-driven rule, signal windowing, the dropped l3_escalation_threshold (L3 gate deferred to the faithfulness study), the D1 escalation_rate bug fix, and the actual module/CLI/test names. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
…-nwJak # Conflicts: # tests/test_cli.py
NeuralMind self-benchmarkStatus: Phase 1 — Reduction on committed fixture
Phase 2 — Learning uplift
Note: uplift numbers on a 500-line fixture are intentionally modest — the point is to Assumptions
Per-model token reduction
Rows marked measured use the provider's real tokenizer. Rows marked Automated by |
CI Lint runs `black --check`; reformat the self-improvement files to match repo style. No behavior change. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a51de67e67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| events = read_events(project_query_events_file(project_path)) | ||
| total = len(events) |
There was a problem hiding this comment.
Count only query events for tuner warmup/window gates
The tuner’s warmup and recent-window guards are derived from read_events(...), which now includes both wakeup and query events, but the optimization signal (re_query_rate) is computed only from queries. In projects with many wakeups and few query pairs, this can satisfy WARMUP_MIN_EVENTS/WINDOW_MIN_EVENTS without real query evidence and then incorrectly lower l2_recall_k because re_query_rate falls to 0.0. This makes the auto-tuner drift downward based on unrelated event volume rather than retrieval behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 6ab7ecd — _compute now filters to event_type == "query" before applying the warm-up and window gates, so wakeup volume can no longer satisfy the thresholds without real query evidence. Added test_wakeup_events_do_not_count_toward_warmup.
Generated by Claude Code
| pairs += 1 | ||
| prev = set( | ||
| (qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", []) | ||
| ) | ||
| cur = set( | ||
| (qs[i].get("retrieval_summary") or {}).get("communities_loaded", []) | ||
| ) | ||
| denom = min(len(prev), len(cur)) | ||
| if denom == 0: | ||
| continue |
There was a problem hiding this comment.
Exclude empty-community pairs from re-query denominator
pairs is incremented before checking whether either query has any loaded communities. When denom == 0, the pair contributes to the denominator but can never contribute to re_query_count, systematically depressing re_query_rate. Queries that return no communities (e.g., sparse/failed retrieval) will therefore bias the tuner toward lowering recall even though those pairs contain no overlap signal.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 6ab7ecd — pairs is now incremented only after the denom == 0 check, so empty-community pairs no longer count toward the denominator. Added test_re_query_rate_excludes_empty_community_pairs.
Generated by Claude Code
There was a problem hiding this comment.
Pull request overview
Implements phases D1–D2 of the self-improvement engine: extends the existing opt-in memory logging substrate to record wakeup events + session IDs, and adds a conservative selector auto-tuner that adjusts l2_recall_k based on recent re_query_rate, persisted in the synapse DB meta table and gated behind NEURALMIND_SELECTOR_AUTOTUNE=1.
Changes:
- Add wakeup event logging + session_id support and read-side aggregation helpers (
recent_events,escalation_rate,re_query_rate,wakeup_only_rate). - Add selector auto-tuning module (
tune_selector,selector_report) and wire it into SessionStart hook + CLI status command; persist tunables insynapses.meta. - Thread
l2_recall_kthroughcore→ContextSelectorand add/extend tests and planning docs.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
neuralmind/memory.py |
Adds session IDs, wakeup logging, and aggregation helpers used for tuning signals. |
neuralmind/self_improve.py |
New module implementing the selector tuner and reporting. |
neuralmind/synapses.py |
Adds get_meta/set_meta helpers on the synapse store. |
neuralmind/core.py |
Logs wakeup events and reads persisted l2_recall_k at build time. |
neuralmind/context_selector.py |
Adds l2_recall_k constructor parameter and forwards it to L2 recall sizing. |
neuralmind/hooks.py |
Runs the tuner on SessionStart when explicitly opted-in. |
neuralmind/cli.py |
Adds neuralmind self-improve status command (human + JSON output). |
tests/test_memory.py |
Adds coverage for session id resolution, wakeup logging, and aggregation helpers. |
tests/test_self_improve.py |
New tests covering tuner decisions, gates, windowing, and fail-open behavior. |
tests/test_synapses.py |
Adds tests for meta get/set behavior. |
tests/test_hooks_synapses.py |
Adds tests for SessionStart autotune gating and fail-open behavior. |
tests/test_context_selector.py |
Adds tests for l2_recall_k plumbing/clamping and forwarding to get_l2_context. |
tests/test_cli.py |
Adds tests for self-improve status output modes and tuned value reflection. |
evals/self_improvement/PLAN.md |
Planning document for the multi-subsystem self-improvement engine. |
evals/faithfulness/PLAN.md |
Planning document for future eval-driven tuning fitness function. |
Comments suppressed due to low confidence (1)
neuralmind/memory.py:322
- In
re_query_rate(),pairsis incremented before checking whether either query has an emptycommunities_loadedset. Whendenom == 0the loopcontinues but the pair is still counted, which will artificially deflate the rate for “no-signal” pairs. Incrementpairsonly afterdenom > 0(or explicitly skip such pairs).
for i in range(1, len(qs)):
pairs += 1
prev = set((qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", []))
cur = set((qs[i].get("retrieval_summary") or {}).get("communities_loaded", []))
denom = min(len(prev), len(cur))
if denom == 0:
continue
if len(prev & cur) / denom >= 0.5:
re_query_count += 1
return (re_query_count / pairs) if pairs else 0.0
def wakeup_only_rate(events: list[dict[str, Any]]) -> float:
"""Fraction of sessions whose only events are wakeups (no queries).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def set_meta(self, key: str, value: str) -> None: | ||
| """Write a value to the key-value meta table.""" | ||
| with self._connect() as conn: | ||
| conn.execute( | ||
| "INSERT OR REPLACE INTO meta(key, value) VALUES (?, ?)", | ||
| (key, str(value)), | ||
| ) |
There was a problem hiding this comment.
Addressed in 6ab7ecd — widened the hint to value: object and noted the str() coercion in the docstring.
Generated by Claude Code
| def escalation_rate(events: list[dict[str, Any]]) -> float: | ||
| """Fraction of *query* events whose layers_used includes L3. | ||
|
|
||
| L3 is the deep-search layer; high escalation suggests L2 community | ||
| summaries are under-recalling for the query distribution. | ||
|
|
||
| layers_used elements are decorated strings produced by the selector | ||
| (e.g. "L3:Search(4 results)"), so match by prefix rather than exact | ||
| membership. | ||
| """ | ||
| queries = [e for e in events if e.get("event_type") == "query"] | ||
| if not queries: | ||
| return 0.0 | ||
| escalated = sum( | ||
| 1 | ||
| for e in queries | ||
| if any( | ||
| str(layer).startswith("L3") | ||
| for layer in (e.get("retrieval_summary") or {}).get("layers_used", []) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Addressed in 6ab7ecd — the predicate now matches "L3" exactly or "L3:"-prefixed strings, so a hypothetical "L30" no longer false-matches. Added test_escalation_rate_does_not_match_l3_prefix_collision.
Generated by Claude Code
| for e in events: | ||
| if e.get("event_type") != "query": | ||
| continue | ||
| sid = e.get("session_id") or "" |
There was a problem hiding this comment.
Addressed in 6ab7ecd — re_query_rate now skips query events with no session_id instead of grouping them under "". Added test_re_query_rate_skips_events_without_session_id.
Generated by Claude Code
Five fixes from Codex/Copilot review, all in the tuner's signal path:
- _compute: warm-up / window gates now count only *query* events.
read_events() returns wakeups too; counting them let the gates pass
without query evidence, then re_query_rate 0.0 drifted l2_recall_k
down on unrelated event volume.
- re_query_rate: increment the pair denominator only when a pair
carries community signal (denom > 0). Empty-community pairs were
counted but could never contribute, deflating the rate.
- re_query_rate: skip events with no session_id (pre-D1 logs) instead
of lumping unrelated history under a shared empty-string key.
- escalation_rate: match the L3 layer exactly ("L3" / "L3:..."), not
any "L3"-prefixed string (would false-match a hypothetical "L30").
- set_meta: widen value type hint to object — it coerces via str().
https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
|
Converting to draft — parked while the v0.6.0 → v0.7.0 (install anywhere) → v0.8 (always-on) → v0.7.x/v0.8.x (enterprise) release train ships. Nothing wrong with the work; it just needs a deliberate decision about whether the v0.4–v0.5 framing still maps to the v0.7+ product surface before sinking a few hours into the rebase against current main. When ready to revive: the Triaged as part of #117 Phase 0 hygiene. Generated by Claude Code |
Summary
Phases 1–2 of the self-improvement engine — the outer loop that tunes
NeuralMind's retrieval behavior from observed agent usage. Both the
data layer and the first tuning loop ship off by default; there is
no behavior change for users who don't opt in.
evals/self_improvement/PLAN.md(the fullengine: subsystems A/B/C) and
evals/faithfulness/PLAN.md(the evalthat subsystem C will later use as a fitness function).
memory.py): wakeup-event logging,session_idon events, and read-side aggregation helpers(
recent_events,escalation_rate,re_query_rate,wakeup_only_rate).l2_recall_k(L2 community recall count) fromre_query_rate,persisted across sessions in the synapse
metatable.D2 details
synapses.py—get_meta/set_metahelpers onSynapseStore.self_improve.py(new) —tune_selector()(re_query_rate-driven,bounded
[2,6], hysteretic dead band, 50-event warm-up, signalwindowed off the last tune timestamp, fail-open) and
selector_report().context_selector.py+core.py—l2_recall_kthreads from thesynapse meta table through to
get_l2_context, read once atconstruction, fully guarded against a missing DB.
hooks.py— SessionStart runs the tuner whenNEURALMIND_SELECTOR_AUTOTUNE=1(opt-in); tuner failures areswallowed so the hook always exits 0.
cli.py—neuralmind self-improve status [project](read-only,--json).escalation_rateused a bare"L3" in layers_usedcheck that never matched real decorated layer strings(
"L3:Search(...)"); fixed to match by prefix.Scope notes
l2_recall_k. The originally-plannedl3_escalation_thresholdwas dropped: L3 deep search currently runsunconditionally, so gating it is net-new behavior — deferred until
evals/faithfulness/can validate it's safe.PLAN.mddocumentsthis.
re_query_rateis a deliberately weak signal and the thresholds areprovisional — which is exactly why the tuner is off by default.
Subsystem C (eval-driven) is the real answer; D2 builds the
mechanism.
Test plan
graphify fixtures, firewall-blocked S3 model download).
test_synapses.py(meta helpers),test_memory.py(escalation_rate fix + D1 aggregations),test_self_improve.py(16 cases — rule, warm-up/window guards,windowing, fail-open),
test_context_selector.py(l2_recall_kthreading),
test_hooks_synapses.py(autotune gate + fail-open),test_cli.py(status command).l2_recall_k3→4 and persists; second tick correctly holds.https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
Generated by Claude Code