Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning by dfrostar · Pull Request #100 · dfrostar/neuralmind

dfrostar · 2026-05-14T15:47:47Z

Summary

Phases 1–2 of the self-improvement engine — the outer loop that tunes
NeuralMind's retrieval behavior from observed agent usage. Both the
data layer and the first tuning loop ship off by default; there is
no behavior change for users who don't opt in.

Planning docs — evals/self_improvement/PLAN.md (the full
engine: subsystems A/B/C) and evals/faithfulness/PLAN.md (the eval
that subsystem C will later use as a fitness function).
D1 — logging substrate (memory.py): wakeup-event logging,
session_id on events, and read-side aggregation helpers
(recent_events, escalation_rate, re_query_rate,
wakeup_only_rate).
D2 — subsystem A, selector auto-tuning: a tuner that adjusts
l2_recall_k (L2 community recall count) from re_query_rate,
persisted across sessions in the synapse meta table.

D2 details

synapses.py — get_meta/set_meta helpers on SynapseStore.
self_improve.py (new) — tune_selector() (re_query_rate-driven,
bounded [2,6], hysteretic dead band, 50-event warm-up, signal
windowed off the last tune timestamp, fail-open) and
selector_report().
context_selector.py + core.py — l2_recall_k threads from the
synapse meta table through to get_l2_context, read once at
construction, fully guarded against a missing DB.
hooks.py — SessionStart runs the tuner when
NEURALMIND_SELECTOR_AUTOTUNE=1 (opt-in); tuner failures are
swallowed so the hook always exits 0.
cli.py — neuralmind self-improve status [project] (read-only,
--json).
Bug fix — D1's escalation_rate used a bare "L3" in layers_used check that never matched real decorated layer strings
("L3:Search(...)"); fixed to match by prefix.

Scope notes

D2 tunes only l2_recall_k. The originally-planned
l3_escalation_threshold was dropped: L3 deep search currently runs
unconditionally, so gating it is net-new behavior — deferred until
evals/faithfulness/ can validate it's safe. PLAN.md documents
this.
re_query_rate is a deliberately weak signal and the thresholds are
provisional — which is exactly why the tuner is off by default.
Subsystem C (eval-driven) is the real answer; D2 builds the
mechanism.

Test plan

Full suite green locally (pre-existing skips only: missing
graphify fixtures, firewall-blocked S3 model download).
New coverage: test_synapses.py (meta helpers),
test_memory.py (escalation_rate fix + D1 aggregations),
test_self_improve.py (16 cases — rule, warm-up/window guards,
windowing, fail-open), test_context_selector.py (l2_recall_k
threading), test_hooks_synapses.py (autotune gate + fail-open),
test_cli.py (status command).
Manual tuner round-trip: 60 high-overlap events → tick raises
l2_recall_k 3→4 and persists; second tick correctly holds.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

Generated by Claude Code

3–5 day scoped plan for measuring whether L0/L1/L2/L3 compressed context preserves answer faithfulness vs. full source. Colocated with the future runner under evals/faithfulness/. Plan only — no implementation.

Three subsystems: (A) selector auto-tuning of L2 recall and L3 escalation thresholds, (B) L2 community-summary refinement with A/B promotion, (C) closed-loop eval-driven tuning that depends on the faithfulness study landing first. Plan reflects the actual MCP/selector architecture: drill events are really query events, L0/L1 are computed not stored so refinement only applies to L2, and all three subsystems share one query_events table in the existing synapse SQLite store. D1 delivers only the logging substrate (no behavior change). A and B ride on top.

Extends the existing memory.py opt-in logger rather than building a new substrate (the plan was over-spec'd; memory.py already does most of it). - log_wakeup_event() — new event type, parallel to log_query_event, written to the same JSONL files. Wired into core.NeuralMind.wakeup(). Needed for the "L0/L1 was sufficient" positive signal: a wakeup with no follow-up query in the same session. - session_id on both event types via _current_session_id() — honors CLAUDE_SESSION_ID, falls back to a stable per-process uuid. Required for intra-session signals (re-query, wakeup-without-followup). - Aggregation helpers: read_events (no event_type filter), recent_events, escalation_rate, re_query_rate, wakeup_only_rate. These are the read-side primitives subsystem A (selector autotuning) will consume. No behavior change for users without the consent sentinel — all logging remains gated by is_memory_logging_enabled(). Plan doc updated to reflect the discovery that the substrate already exists, and to drop the new-table proposal in favor of extending memory.py.

The meta table existed but only decay() wrote to it via raw SQL. The selector auto-tuner (D2) needs a clean key-value read/write path to persist l2_recall_k across sessions. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

ContextResult.layers_used elements are decorated by the selector ("L3:Search(4 results)", not "L3"), so the bare `"L3" in layers_used` membership check never matched real events — escalation_rate always returned 0.0. The D1 tests used unrealistic fixtures so the bug passed review. Match by "L3" prefix instead. re_query_rate (keys off communities_loaded) and wakeup_only_rate (keys off event_type) are unaffected. Updates the escalation test fixtures to realistic decorated strings and adds a regression case. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

tune_selector() reads logged query events, computes re_query_rate over the events since the last tune, and adjusts l2_recall_k (L2 community recall count) up or down by one step, persisting it to the synapse store's meta table. The rule is deliberately conservative: a 50-event warm-up gate, a dead band between the two thresholds, single-step moves bounded to [2, 6], and event windowing keyed off l2_recall_k_tuned_at so the tuner doesn't chase a distribution it just perturbed. re_query_rate is a weak signal and the thresholds are provisional — subsystem C (eval-driven tuning) will replace the hand-tuned rule later. tune_selector never raises (fail-open, safe for the hook path). selector_report() is the read-only view for the CLI. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

ContextSelector gains an l2_recall_k constructor arg (defensively clamped) that get_context forwards to get_l2_context as max_communities. core.py reads the per-project tuned value from the synapse meta table before constructing the selector, falling back to the default if it's absent or unreadable — the read path is fully guarded so context selection never hard-fails on a missing DB. Read once at construction rather than per-query: the value changes at most once per session (in the SessionStart tuner tick), so a per-call DB read would be pure overhead. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

The SessionStart hook now calls tune_selector() after the decay tick, gated on NEURALMIND_SELECTOR_AUTOTUNE=1 (opt-in, since it is net behavior change). One JSONL read plus at most one meta write — stays fast — and tuner failures are swallowed so the hook always returns 0. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

`neuralmind self-improve status [project]` prints the selector tuner state: current l2_recall_k, when it was last tuned, event counts, windowed re_query_rate, and whether autotune is enabled. Read-only, --json supported. Nested under a `self-improve` parser so future self-improvement subsystems can attach their own sub-actions. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

The window is keyed off l2_recall_k_tuned_at, so immediately after a tune — before fresh events arrive — it is empty. re_query_rate([]) returns 0.0, which the rule misread as "low re-query rate → lower k", causing the tuner to undo its own previous move on the very next tick. Add a WINDOW_MIN_EVENTS (20) guard: an empty/thin window means "no signal, hold", not "rate is zero". Caught by the manual round-trip in D2 verification. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

D2 shipped subsystem A scoped to l2_recall_k only. Documents the re_query_rate-driven rule, signal windowing, the dropped l3_escalation_threshold (L3 gate deferred to the faithfulness study), the D1 escalation_rate bug fix, and the actual module/CLI/test names. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

…-nwJak # Conflicts: # tests/test_cli.py

github-actions · 2026-05-14T15:49:12Z

NeuralMind self-benchmark

Status: PASS — floor 4×, measured 6.1×.

Phase 1 — Reduction on committed fixture

Average reduction: 6.1×
Top-k retrieval hit rate: 71.7%
Naive baseline: 47,360 tokens (all fixture files concatenated)
NeuralMind total: 7,874 tokens across 10 queries
Estimated monthly savings @ 100 queries/day on Claude 3.5 Sonnet: ~$35.54

#	Query	Shape	Naive	NeuralMind	Ratio	Hit
1	`auth-flow`	cross-file	4,736	790	6.0×	33.3%
2	`api-endpoints`	focused	4,736	784	6.0×	100.0%
3	`billing-flow`	cross-file	4,736	816	5.8×	33.3%
4	`user-storage`	cross-file	4,736	647	7.3×	50.0%
5	`jwt-verify`	focused	4,736	651	7.3×	100.0%
6	`stripe-webhook`	focused	4,736	808	5.9×	100.0%
7	`create-user`	cross-file	4,736	769	6.2×	50.0%
8	`refund`	focused	4,736	797	5.9×	100.0%
9	`db-choice`	identity	4,736	874	5.4×	100.0%
10	`invoice-send`	cross-file	4,736	938	5.0×	50.0%

Phase 2 — Learning uplift

Memory events logged: 20
Learned patterns: 20
Reduction ratio after neuralmind learn: 6.1× (Δ +0.00× vs. cold)
Top-k hit rate after learning: 71.7% (Δ +0.0 points vs. cold)

Note: uplift numbers on a 500-line fixture are intentionally modest — the point is to
verify the learning mechanism persists and applies. On real production repos the lift
is larger; this test only catches regressions in persistence.

Assumptions

Baseline: every .py file in tests/fixtures/sample_project/ concatenated.
Tokenizer: tiktoken GPT-4o encoding (per-model breakdown in multi_model.json if generated).
Pricing: Claude 3.5 Sonnet input @ $3.0/MTok.
Regression floor: 4× — well below NeuralMind's typical 40–70× on real repos.

Per-model token reduction

Model	Tokenizer	Naive	NeuralMind	Ratio	Source
GPT-4o / GPT-4o-mini	`tiktoken o200k_base`	4,739	779	6.1×	measured
GPT-4 / GPT-3.5-turbo	`tiktoken cl100k_base`	4,710	770	6.1×	measured
Claude 3.5 Sonnet	`estimated: GPT-4o × 1.08 — install` anthropic `for an exact count`	5,118	841	6.1×	estimated
Llama 3 (70B)	`estimated: GPT-4o × 1.22 — Llama tokenizer requires model weights; estimate based on published vocab ratios`	5,781	950	6.1×	estimated

Rows marked measured use the provider's real tokenizer. Rows marked
estimated apply a published vocab-size correction to the GPT-4o count —
honest approximations, not hardcoded claims.

Automated by .github/workflows/ci-benchmark.yml — regenerate locally with python -m tests.benchmark.run.

CI Lint runs `black --check`; reformat the self-improvement files to match repo style. No behavior change. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a51de67e67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T15:50:31Z

+    events = read_events(project_query_events_file(project_path))
+    total = len(events)


Count only query events for tuner warmup/window gates

The tuner’s warmup and recent-window guards are derived from read_events(...), which now includes both wakeup and query events, but the optimization signal (re_query_rate) is computed only from queries. In projects with many wakeups and few query pairs, this can satisfy WARMUP_MIN_EVENTS/WINDOW_MIN_EVENTS without real query evidence and then incorrectly lower l2_recall_k because re_query_rate falls to 0.0. This makes the auto-tuner drift downward based on unrelated event volume rather than retrieval behavior.

Useful? React with 👍 / 👎.

Addressed in 6ab7ecd — _compute now filters to event_type == "query" before applying the warm-up and window gates, so wakeup volume can no longer satisfy the thresholds without real query evidence. Added test_wakeup_events_do_not_count_toward_warmup.

Generated by Claude Code

chatgpt-codex-connector · 2026-05-14T15:50:31Z

+            pairs += 1
+            prev = set(
+                (qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", [])
+            )
+            cur = set(
+                (qs[i].get("retrieval_summary") or {}).get("communities_loaded", [])
+            )
+            denom = min(len(prev), len(cur))
+            if denom == 0:
+                continue


Exclude empty-community pairs from re-query denominator

pairs is incremented before checking whether either query has any loaded communities. When denom == 0, the pair contributes to the denominator but can never contribute to re_query_count, systematically depressing re_query_rate. Queries that return no communities (e.g., sparse/failed retrieval) will therefore bias the tuner toward lowering recall even though those pairs contain no overlap signal.

Useful? React with 👍 / 👎.

Addressed in 6ab7ecd — pairs is now incremented only after the denom == 0 check, so empty-community pairs no longer count toward the denominator. Added test_re_query_rate_excludes_empty_community_pairs.

Generated by Claude Code

Copilot

Pull request overview

Implements phases D1–D2 of the self-improvement engine: extends the existing opt-in memory logging substrate to record wakeup events + session IDs, and adds a conservative selector auto-tuner that adjusts l2_recall_k based on recent re_query_rate, persisted in the synapse DB meta table and gated behind NEURALMIND_SELECTOR_AUTOTUNE=1.

Changes:

Add wakeup event logging + session_id support and read-side aggregation helpers (recent_events, escalation_rate, re_query_rate, wakeup_only_rate).
Add selector auto-tuning module (tune_selector, selector_report) and wire it into SessionStart hook + CLI status command; persist tunables in synapses.meta.
Thread l2_recall_k through core → ContextSelector and add/extend tests and planning docs.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`neuralmind/memory.py`	Adds session IDs, wakeup logging, and aggregation helpers used for tuning signals.
`neuralmind/self_improve.py`	New module implementing the selector tuner and reporting.
`neuralmind/synapses.py`	Adds `get_meta`/`set_meta` helpers on the synapse store.
`neuralmind/core.py`	Logs wakeup events and reads persisted `l2_recall_k` at build time.
`neuralmind/context_selector.py`	Adds `l2_recall_k` constructor parameter and forwards it to L2 recall sizing.
`neuralmind/hooks.py`	Runs the tuner on SessionStart when explicitly opted-in.
`neuralmind/cli.py`	Adds `neuralmind self-improve status` command (human + JSON output).
`tests/test_memory.py`	Adds coverage for session id resolution, wakeup logging, and aggregation helpers.
`tests/test_self_improve.py`	New tests covering tuner decisions, gates, windowing, and fail-open behavior.
`tests/test_synapses.py`	Adds tests for meta get/set behavior.
`tests/test_hooks_synapses.py`	Adds tests for SessionStart autotune gating and fail-open behavior.
`tests/test_context_selector.py`	Adds tests for `l2_recall_k` plumbing/clamping and forwarding to `get_l2_context`.
`tests/test_cli.py`	Adds tests for `self-improve status` output modes and tuned value reflection.
`evals/self_improvement/PLAN.md`	Planning document for the multi-subsystem self-improvement engine.
`evals/faithfulness/PLAN.md`	Planning document for future eval-driven tuning fitness function.

Comments suppressed due to low confidence (1)

neuralmind/memory.py:322

In re_query_rate(), pairs is incremented before checking whether either query has an empty communities_loaded set. When denom == 0 the loop continues but the pair is still counted, which will artificially deflate the rate for “no-signal” pairs. Increment pairs only after denom > 0 (or explicitly skip such pairs).

        for i in range(1, len(qs)):
            pairs += 1
            prev = set((qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", []))
            cur = set((qs[i].get("retrieval_summary") or {}).get("communities_loaded", []))
            denom = min(len(prev), len(cur))
            if denom == 0:
                continue
            if len(prev & cur) / denom >= 0.5:
                re_query_count += 1
    return (re_query_count / pairs) if pairs else 0.0


def wakeup_only_rate(events: list[dict[str, Any]]) -> float:
    """Fraction of sessions whose only events are wakeups (no queries).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dfrostar · 2026-05-15T02:22:22Z

+    def set_meta(self, key: str, value: str) -> None:
+        """Write a value to the key-value meta table."""
+        with self._connect() as conn:
+            conn.execute(
+                "INSERT OR REPLACE INTO meta(key, value) VALUES (?, ?)",
+                (key, str(value)),
+            )


Addressed in 6ab7ecd — widened the hint to value: object and noted the str() coercion in the docstring.

Generated by Claude Code

dfrostar · 2026-05-15T02:22:24Z

+def escalation_rate(events: list[dict[str, Any]]) -> float:
+    """Fraction of *query* events whose layers_used includes L3.
+
+    L3 is the deep-search layer; high escalation suggests L2 community
+    summaries are under-recalling for the query distribution.
+
+    layers_used elements are decorated strings produced by the selector
+    (e.g. "L3:Search(4 results)"), so match by prefix rather than exact
+    membership.
+    """
+    queries = [e for e in events if e.get("event_type") == "query"]
+    if not queries:
+        return 0.0
+    escalated = sum(
+        1
+        for e in queries
+        if any(
+            str(layer).startswith("L3")
+            for layer in (e.get("retrieval_summary") or {}).get("layers_used", [])
+        )
+    )


Addressed in 6ab7ecd — the predicate now matches "L3" exactly or "L3:"-prefixed strings, so a hypothetical "L30" no longer false-matches. Added test_escalation_rate_does_not_match_l3_prefix_collision.

Generated by Claude Code

dfrostar · 2026-05-15T02:22:25Z

+    for e in events:
+        if e.get("event_type") != "query":
+            continue
+        sid = e.get("session_id") or ""


Addressed in 6ab7ecd — re_query_rate now skips query events with no session_id instead of grouping them under "". Added test_re_query_rate_skips_events_without_session_id.

Generated by Claude Code

Five fixes from Codex/Copilot review, all in the tuner's signal path: - _compute: warm-up / window gates now count only *query* events. read_events() returns wakeups too; counting them let the gates pass without query evidence, then re_query_rate 0.0 drifted l2_recall_k down on unrelated event volume. - re_query_rate: increment the pair denominator only when a pair carries community signal (denom > 0). Empty-community pairs were counted but could never contribute, deflating the rate. - re_query_rate: skip events with no session_id (pre-D1 logs) instead of lumping unrelated history under a shared empty-string key. - escalation_rate: match the L3 layer exactly ("L3" / "L3:..."), not any "L3"-prefixed string (would false-match a hypothetical "L30"). - set_meta: widen value type hint to object — it coerces via str(). https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

dfrostar · 2026-05-18T00:30:57Z

Converting to draft — parked while the v0.6.0 → v0.7.0 (install anywhere) → v0.8 (always-on) → v0.7.x/v0.8.x (enterprise) release train ships. Nothing wrong with the work; it just needs a deliberate decision about whether the v0.4–v0.5 framing still maps to the v0.7+ product surface before sinking a few hours into the rebase against current main.

When ready to revive: the evals/self_improvement/PLAN.md and evals/faithfulness/PLAN.md planning docs are the unique value (worth preserving even if the D1/D2 code itself ends up rewritten); D2's opt-in NEURALMIND_SELECTOR_AUTOTUNE=1 design means it's safe to ship without behavior regressions when the time comes.

Triaged as part of #117 Phase 0 hygiene.

Generated by Claude Code

claude added 12 commits May 6, 2026 19:30

docs(evals): add faithfulness study plan

3da222f

3–5 day scoped plan for measuring whether L0/L1/L2/L3 compressed context preserves answer faithfulness vs. full source. Colocated with the future runner under evals/faithfulness/. Plan only — no implementation.

Merge remote-tracking branch 'origin/main' into claude/plan-next-task…

a51de67

…-nwJak # Conflicts: # tests/test_cli.py

Copilot AI review requested due to automatic review settings May 14, 2026 15:47

github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request performance Performance improvements question Further information is requested labels May 14, 2026

Copilot started reviewing on behalf of dfrostar May 14, 2026 15:48 View session

style: apply black formatting to D1/D2 files

8aae463

CI Lint runs `black --check`; reformat the self-improvement files to match repo style. No behavior change. https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

dfrostar mentioned this pull request May 15, 2026

[v0.6.1 Phase 0] Hygiene cleanup — branches, About-box, topics #117

Open

12 tasks

dfrostar marked this pull request as draft May 18, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100

Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100
dfrostar wants to merge 14 commits into
mainfrom
claude/plan-next-task-nwJak

dfrostar commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

dfrostar May 15, 2026

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

dfrostar May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

dfrostar May 15, 2026

Uh oh!

dfrostar May 15, 2026

Uh oh!

dfrostar May 15, 2026

Uh oh!

dfrostar commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		events = read_events(project_query_events_file(project_path))
		total = len(events)

Conversation

dfrostar commented May 14, 2026

Summary

D2 details

Scope notes

Test plan

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NeuralMind self-benchmark

Phase 1 — Reduction on committed fixture

Phase 2 — Learning uplift

Assumptions

Per-model token reduction

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

dfrostar May 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

dfrostar May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

dfrostar May 15, 2026

Choose a reason for hiding this comment

Uh oh!

dfrostar May 15, 2026

Choose a reason for hiding this comment

Uh oh!

dfrostar May 15, 2026

Choose a reason for hiding this comment

Uh oh!

dfrostar commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 14, 2026 •

edited

Loading