Skip to content

Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100

Draft
dfrostar wants to merge 14 commits into
mainfrom
claude/plan-next-task-nwJak
Draft

Self-improvement engine: plans + D1 logging substrate + D2 selector auto-tuning#100
dfrostar wants to merge 14 commits into
mainfrom
claude/plan-next-task-nwJak

Conversation

@dfrostar
Copy link
Copy Markdown
Owner

Summary

Phases 1–2 of the self-improvement engine — the outer loop that tunes
NeuralMind's retrieval behavior from observed agent usage. Both the
data layer and the first tuning loop ship off by default; there is
no behavior change for users who don't opt in.

  • Planning docsevals/self_improvement/PLAN.md (the full
    engine: subsystems A/B/C) and evals/faithfulness/PLAN.md (the eval
    that subsystem C will later use as a fitness function).
  • D1 — logging substrate (memory.py): wakeup-event logging,
    session_id on events, and read-side aggregation helpers
    (recent_events, escalation_rate, re_query_rate,
    wakeup_only_rate).
  • D2 — subsystem A, selector auto-tuning: a tuner that adjusts
    l2_recall_k (L2 community recall count) from re_query_rate,
    persisted across sessions in the synapse meta table.

D2 details

  • synapses.pyget_meta/set_meta helpers on SynapseStore.
  • self_improve.py (new) — tune_selector() (re_query_rate-driven,
    bounded [2,6], hysteretic dead band, 50-event warm-up, signal
    windowed off the last tune timestamp, fail-open) and
    selector_report().
  • context_selector.py + core.pyl2_recall_k threads from the
    synapse meta table through to get_l2_context, read once at
    construction, fully guarded against a missing DB.
  • hooks.py — SessionStart runs the tuner when
    NEURALMIND_SELECTOR_AUTOTUNE=1 (opt-in); tuner failures are
    swallowed so the hook always exits 0.
  • cli.pyneuralmind self-improve status [project] (read-only,
    --json).
  • Bug fix — D1's escalation_rate used a bare "L3" in layers_used check that never matched real decorated layer strings
    ("L3:Search(...)"); fixed to match by prefix.

Scope notes

  • D2 tunes only l2_recall_k. The originally-planned
    l3_escalation_threshold was dropped: L3 deep search currently runs
    unconditionally, so gating it is net-new behavior — deferred until
    evals/faithfulness/ can validate it's safe. PLAN.md documents
    this.
  • re_query_rate is a deliberately weak signal and the thresholds are
    provisional — which is exactly why the tuner is off by default.
    Subsystem C (eval-driven) is the real answer; D2 builds the
    mechanism.

Test plan

  • Full suite green locally (pre-existing skips only: missing
    graphify fixtures, firewall-blocked S3 model download).
  • New coverage: test_synapses.py (meta helpers),
    test_memory.py (escalation_rate fix + D1 aggregations),
    test_self_improve.py (16 cases — rule, warm-up/window guards,
    windowing, fail-open), test_context_selector.py (l2_recall_k
    threading), test_hooks_synapses.py (autotune gate + fail-open),
    test_cli.py (status command).
  • Manual tuner round-trip: 60 high-overlap events → tick raises
    l2_recall_k 3→4 and persists; second tick correctly holds.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH


Generated by Claude Code

claude added 12 commits May 6, 2026 19:30
3–5 day scoped plan for measuring whether L0/L1/L2/L3 compressed context
preserves answer faithfulness vs. full source. Colocated with the future
runner under evals/faithfulness/. Plan only — no implementation.
Three subsystems: (A) selector auto-tuning of L2 recall and L3
escalation thresholds, (B) L2 community-summary refinement with A/B
promotion, (C) closed-loop eval-driven tuning that depends on the
faithfulness study landing first.

Plan reflects the actual MCP/selector architecture: drill events are
really query events, L0/L1 are computed not stored so refinement only
applies to L2, and all three subsystems share one query_events table
in the existing synapse SQLite store.

D1 delivers only the logging substrate (no behavior change). A and B
ride on top.
Extends the existing memory.py opt-in logger rather than building a new
substrate (the plan was over-spec'd; memory.py already does most of it).

- log_wakeup_event() — new event type, parallel to log_query_event,
  written to the same JSONL files. Wired into core.NeuralMind.wakeup().
  Needed for the "L0/L1 was sufficient" positive signal: a wakeup with
  no follow-up query in the same session.
- session_id on both event types via _current_session_id() — honors
  CLAUDE_SESSION_ID, falls back to a stable per-process uuid. Required
  for intra-session signals (re-query, wakeup-without-followup).
- Aggregation helpers: read_events (no event_type filter),
  recent_events, escalation_rate, re_query_rate, wakeup_only_rate.
  These are the read-side primitives subsystem A (selector autotuning)
  will consume.

No behavior change for users without the consent sentinel — all logging
remains gated by is_memory_logging_enabled().

Plan doc updated to reflect the discovery that the substrate already
exists, and to drop the new-table proposal in favor of extending
memory.py.
The meta table existed but only decay() wrote to it via raw SQL. The
selector auto-tuner (D2) needs a clean key-value read/write path to
persist l2_recall_k across sessions.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
ContextResult.layers_used elements are decorated by the selector
("L3:Search(4 results)", not "L3"), so the bare `"L3" in layers_used`
membership check never matched real events — escalation_rate always
returned 0.0. The D1 tests used unrealistic fixtures so the bug passed
review.

Match by "L3" prefix instead. re_query_rate (keys off
communities_loaded) and wakeup_only_rate (keys off event_type) are
unaffected. Updates the escalation test fixtures to realistic decorated
strings and adds a regression case.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
tune_selector() reads logged query events, computes re_query_rate over
the events since the last tune, and adjusts l2_recall_k (L2 community
recall count) up or down by one step, persisting it to the synapse
store's meta table.

The rule is deliberately conservative: a 50-event warm-up gate, a dead
band between the two thresholds, single-step moves bounded to [2, 6],
and event windowing keyed off l2_recall_k_tuned_at so the tuner doesn't
chase a distribution it just perturbed. re_query_rate is a weak signal
and the thresholds are provisional — subsystem C (eval-driven tuning)
will replace the hand-tuned rule later.

tune_selector never raises (fail-open, safe for the hook path).
selector_report() is the read-only view for the CLI.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
ContextSelector gains an l2_recall_k constructor arg (defensively
clamped) that get_context forwards to get_l2_context as
max_communities. core.py reads the per-project tuned value from the
synapse meta table before constructing the selector, falling back to
the default if it's absent or unreadable — the read path is fully
guarded so context selection never hard-fails on a missing DB.

Read once at construction rather than per-query: the value changes at
most once per session (in the SessionStart tuner tick), so a per-call
DB read would be pure overhead.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
The SessionStart hook now calls tune_selector() after the decay tick,
gated on NEURALMIND_SELECTOR_AUTOTUNE=1 (opt-in, since it is net
behavior change). One JSONL read plus at most one meta write — stays
fast — and tuner failures are swallowed so the hook always returns 0.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
`neuralmind self-improve status [project]` prints the selector tuner
state: current l2_recall_k, when it was last tuned, event counts,
windowed re_query_rate, and whether autotune is enabled. Read-only,
--json supported. Nested under a `self-improve` parser so future
self-improvement subsystems can attach their own sub-actions.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
The window is keyed off l2_recall_k_tuned_at, so immediately after a
tune — before fresh events arrive — it is empty. re_query_rate([])
returns 0.0, which the rule misread as "low re-query rate → lower k",
causing the tuner to undo its own previous move on the very next tick.

Add a WINDOW_MIN_EVENTS (20) guard: an empty/thin window means "no
signal, hold", not "rate is zero". Caught by the manual round-trip in
D2 verification.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
D2 shipped subsystem A scoped to l2_recall_k only. Documents the
re_query_rate-driven rule, signal windowing, the dropped
l3_escalation_threshold (L3 gate deferred to the faithfulness study),
the D1 escalation_rate bug fix, and the actual module/CLI/test names.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
Copilot AI review requested due to automatic review settings May 14, 2026 15:47
@github-actions github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request performance Performance improvements question Further information is requested labels May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

NeuralMind self-benchmark

Status: PASS — floor , measured 6.1×.

Phase 1 — Reduction on committed fixture

  • Average reduction: 6.1×
  • Top-k retrieval hit rate: 71.7%
  • Naive baseline: 47,360 tokens (all fixture files concatenated)
  • NeuralMind total: 7,874 tokens across 10 queries
  • Estimated monthly savings @ 100 queries/day on Claude 3.5 Sonnet: ~$35.54
# Query Shape Naive NeuralMind Ratio Hit
1 auth-flow cross-file 4,736 790 6.0× 33.3%
2 api-endpoints focused 4,736 784 6.0× 100.0%
3 billing-flow cross-file 4,736 816 5.8× 33.3%
4 user-storage cross-file 4,736 647 7.3× 50.0%
5 jwt-verify focused 4,736 651 7.3× 100.0%
6 stripe-webhook focused 4,736 808 5.9× 100.0%
7 create-user cross-file 4,736 769 6.2× 50.0%
8 refund focused 4,736 797 5.9× 100.0%
9 db-choice identity 4,736 874 5.4× 100.0%
10 invoice-send cross-file 4,736 938 5.0× 50.0%

Phase 2 — Learning uplift

  • Memory events logged: 20
  • Learned patterns: 20
  • Reduction ratio after neuralmind learn: 6.1× (Δ +0.00× vs. cold)
  • Top-k hit rate after learning: 71.7% (Δ +0.0 points vs. cold)

Note: uplift numbers on a 500-line fixture are intentionally modest — the point is to
verify the learning mechanism persists and applies. On real production repos the lift
is larger; this test only catches regressions in persistence.

Assumptions

  • Baseline: every .py file in tests/fixtures/sample_project/ concatenated.
  • Tokenizer: tiktoken GPT-4o encoding (per-model breakdown in multi_model.json if generated).
  • Pricing: Claude 3.5 Sonnet input @ $3.0/MTok.
  • Regression floor: — well below NeuralMind's typical 40–70× on real repos.

Per-model token reduction

Model Tokenizer Naive NeuralMind Ratio Source
GPT-4o / GPT-4o-mini tiktoken o200k_base 4,739 779 6.1× measured
GPT-4 / GPT-3.5-turbo tiktoken cl100k_base 4,710 770 6.1× measured
Claude 3.5 Sonnet estimated: GPT-4o × 1.08 — install anthropic for an exact count 5,118 841 6.1× estimated
Llama 3 (70B) estimated: GPT-4o × 1.22 — Llama tokenizer requires model weights; estimate based on published vocab ratios 5,781 950 6.1× estimated

Rows marked measured use the provider's real tokenizer. Rows marked
estimated apply a published vocab-size correction to the GPT-4o count —
honest approximations, not hardcoded claims.


Automated by .github/workflows/ci-benchmark.yml — regenerate locally with python -m tests.benchmark.run.

CI Lint runs `black --check`; reformat the self-improvement files to
match repo style. No behavior change.

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a51de67e67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread neuralmind/self_improve.py Outdated
Comment on lines +74 to +75
events = read_events(project_query_events_file(project_path))
total = len(events)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Count only query events for tuner warmup/window gates

The tuner’s warmup and recent-window guards are derived from read_events(...), which now includes both wakeup and query events, but the optimization signal (re_query_rate) is computed only from queries. In projects with many wakeups and few query pairs, this can satisfy WARMUP_MIN_EVENTS/WINDOW_MIN_EVENTS without real query evidence and then incorrectly lower l2_recall_k because re_query_rate falls to 0.0. This makes the auto-tuner drift downward based on unrelated event volume rather than retrieval behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ab7ecd_compute now filters to event_type == "query" before applying the warm-up and window gates, so wakeup volume can no longer satisfy the thresholds without real query evidence. Added test_wakeup_events_do_not_count_toward_warmup.


Generated by Claude Code

Comment thread neuralmind/memory.py Outdated
Comment on lines +310 to +319
pairs += 1
prev = set(
(qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", [])
)
cur = set(
(qs[i].get("retrieval_summary") or {}).get("communities_loaded", [])
)
denom = min(len(prev), len(cur))
if denom == 0:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude empty-community pairs from re-query denominator

pairs is incremented before checking whether either query has any loaded communities. When denom == 0, the pair contributes to the denominator but can never contribute to re_query_count, systematically depressing re_query_rate. Queries that return no communities (e.g., sparse/failed retrieval) will therefore bias the tuner toward lowering recall even though those pairs contain no overlap signal.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ab7ecdpairs is now incremented only after the denom == 0 check, so empty-community pairs no longer count toward the denominator. Added test_re_query_rate_excludes_empty_community_pairs.


Generated by Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements phases D1–D2 of the self-improvement engine: extends the existing opt-in memory logging substrate to record wakeup events + session IDs, and adds a conservative selector auto-tuner that adjusts l2_recall_k based on recent re_query_rate, persisted in the synapse DB meta table and gated behind NEURALMIND_SELECTOR_AUTOTUNE=1.

Changes:

  • Add wakeup event logging + session_id support and read-side aggregation helpers (recent_events, escalation_rate, re_query_rate, wakeup_only_rate).
  • Add selector auto-tuning module (tune_selector, selector_report) and wire it into SessionStart hook + CLI status command; persist tunables in synapses.meta.
  • Thread l2_recall_k through coreContextSelector and add/extend tests and planning docs.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
neuralmind/memory.py Adds session IDs, wakeup logging, and aggregation helpers used for tuning signals.
neuralmind/self_improve.py New module implementing the selector tuner and reporting.
neuralmind/synapses.py Adds get_meta/set_meta helpers on the synapse store.
neuralmind/core.py Logs wakeup events and reads persisted l2_recall_k at build time.
neuralmind/context_selector.py Adds l2_recall_k constructor parameter and forwards it to L2 recall sizing.
neuralmind/hooks.py Runs the tuner on SessionStart when explicitly opted-in.
neuralmind/cli.py Adds neuralmind self-improve status command (human + JSON output).
tests/test_memory.py Adds coverage for session id resolution, wakeup logging, and aggregation helpers.
tests/test_self_improve.py New tests covering tuner decisions, gates, windowing, and fail-open behavior.
tests/test_synapses.py Adds tests for meta get/set behavior.
tests/test_hooks_synapses.py Adds tests for SessionStart autotune gating and fail-open behavior.
tests/test_context_selector.py Adds tests for l2_recall_k plumbing/clamping and forwarding to get_l2_context.
tests/test_cli.py Adds tests for self-improve status output modes and tuned value reflection.
evals/self_improvement/PLAN.md Planning document for the multi-subsystem self-improvement engine.
evals/faithfulness/PLAN.md Planning document for future eval-driven tuning fitness function.
Comments suppressed due to low confidence (1)

neuralmind/memory.py:322

  • In re_query_rate(), pairs is incremented before checking whether either query has an empty communities_loaded set. When denom == 0 the loop continues but the pair is still counted, which will artificially deflate the rate for “no-signal” pairs. Increment pairs only after denom > 0 (or explicitly skip such pairs).
        for i in range(1, len(qs)):
            pairs += 1
            prev = set((qs[i - 1].get("retrieval_summary") or {}).get("communities_loaded", []))
            cur = set((qs[i].get("retrieval_summary") or {}).get("communities_loaded", []))
            denom = min(len(prev), len(cur))
            if denom == 0:
                continue
            if len(prev & cur) / denom >= 0.5:
                re_query_count += 1
    return (re_query_count / pairs) if pairs else 0.0


def wakeup_only_rate(events: list[dict[str, Any]]) -> float:
    """Fraction of sessions whose only events are wakeups (no queries).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread neuralmind/synapses.py Outdated
Comment on lines +214 to +220
def set_meta(self, key: str, value: str) -> None:
"""Write a value to the key-value meta table."""
with self._connect() as conn:
conn.execute(
"INSERT OR REPLACE INTO meta(key, value) VALUES (?, ?)",
(key, str(value)),
)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ab7ecd — widened the hint to value: object and noted the str() coercion in the docstring.


Generated by Claude Code

Comment thread neuralmind/memory.py
Comment on lines +267 to +287
def escalation_rate(events: list[dict[str, Any]]) -> float:
"""Fraction of *query* events whose layers_used includes L3.

L3 is the deep-search layer; high escalation suggests L2 community
summaries are under-recalling for the query distribution.

layers_used elements are decorated strings produced by the selector
(e.g. "L3:Search(4 results)"), so match by prefix rather than exact
membership.
"""
queries = [e for e in events if e.get("event_type") == "query"]
if not queries:
return 0.0
escalated = sum(
1
for e in queries
if any(
str(layer).startswith("L3")
for layer in (e.get("retrieval_summary") or {}).get("layers_used", [])
)
)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ab7ecd — the predicate now matches "L3" exactly or "L3:"-prefixed strings, so a hypothetical "L30" no longer false-matches. Added test_escalation_rate_does_not_match_l3_prefix_collision.


Generated by Claude Code

Comment thread neuralmind/memory.py Outdated
for e in events:
if e.get("event_type") != "query":
continue
sid = e.get("session_id") or ""
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ab7ecdre_query_rate now skips query events with no session_id instead of grouping them under "". Added test_re_query_rate_skips_events_without_session_id.


Generated by Claude Code

Five fixes from Codex/Copilot review, all in the tuner's signal path:

- _compute: warm-up / window gates now count only *query* events.
  read_events() returns wakeups too; counting them let the gates pass
  without query evidence, then re_query_rate 0.0 drifted l2_recall_k
  down on unrelated event volume.
- re_query_rate: increment the pair denominator only when a pair
  carries community signal (denom > 0). Empty-community pairs were
  counted but could never contribute, deflating the rate.
- re_query_rate: skip events with no session_id (pre-D1 logs) instead
  of lumping unrelated history under a shared empty-string key.
- escalation_rate: match the L3 layer exactly ("L3" / "L3:..."), not
  any "L3"-prefixed string (would false-match a hypothetical "L30").
- set_meta: widen value type hint to object — it coerces via str().

https://claude.ai/code/session_01KPSwpRuBicQJYjcwVcArnH
Copy link
Copy Markdown
Owner Author

Converting to draft — parked while the v0.6.0 → v0.7.0 (install anywhere) → v0.8 (always-on) → v0.7.x/v0.8.x (enterprise) release train ships. Nothing wrong with the work; it just needs a deliberate decision about whether the v0.4–v0.5 framing still maps to the v0.7+ product surface before sinking a few hours into the rebase against current main.

When ready to revive: the evals/self_improvement/PLAN.md and evals/faithfulness/PLAN.md planning docs are the unique value (worth preserving even if the D1/D2 code itself ends up rewritten); D2's opt-in NEURALMIND_SELECTOR_AUTOTUNE=1 design means it's safe to ship without behavior regressions when the time comes.

Triaged as part of #117 Phase 0 hygiene.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request performance Performance improvements question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants