memory-daemon
diff --git a/‎.github/copilot-instructions.md‎
Lines changed: 37 additions & 8 deletions b/‎.github/copilot-instructions.md‎
Lines changed: 37 additions & 8 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 8 additions & 2 deletions b/‎README.md‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎docs/INGEST_PIPELINE.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/INGEST_PIPELINE.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/INTERNALS.md‎
Lines changed: 10 additions & 4 deletions b/‎docs/INTERNALS.md‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎scripts/analyze_hf_dataset.py‎
Lines changed: 72 additions & 0 deletions b/‎scripts/analyze_hf_dataset.py‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎scripts/clean_analysis.py‎
Lines changed: 48 additions & 0 deletions b/‎scripts/clean_analysis.py‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎scripts/detailed_cross_run.py‎
Lines changed: 74 additions & 0 deletions b/‎scripts/detailed_cross_run.py‎
Lines changed: 74 additions & 0 deletions
@@ -41,7 +41,8 @@ internal/
     atlas.go                Atlas-proper implementation — hybrid search, RRF, MMR re-ranking
   chunker/                  Paragraph-boundary text splitting (~512 token chunks)
   redact/                   Security scrubbing (AWS keys, API tokens, passwords, PII)
-  quality/                  Adaptive learning — tracks retrieval hits, learning threshold
+  quality/                  Adaptive learning — tracks retrieval hits, content scoring, noise prototypes
+  rejection/                Ring-buffer of rejected exchanges, adaptive noise learning
   steward/                  Background quality maintenance (scoring, pruning, merging)
   ingest/                   Source ingestion: crawl → chunk → batch-embed → store
   crawler/                  BFS web crawler with SHA256 change detection
@@ -92,13 +93,24 @@ if hs, ok := rp.store.(store.HybridSearcher); ok {
 
 **Write path** (every response):
 1. Response buffered from SSE stream
-2. Text chunked at paragraph boundaries (~512 tokens)
-3. Noise filtered (< 20 chars, < 40% alphanumeric)
-4. Secrets redacted before embedding
-5. All chunks batch-embedded in single HTTP call to llama.cpp
-6. Each chunk dedup-checked against store (cosine ≥ 0.92 = skip)
-7. Similar-to-source chunks tagged as extensions (cosine ≥ 0.75)
-8. Stored. All of this runs async in a goroutine — zero latency to Claude.
+2. **Pre-Haiku gates** (before any LLM call):
+   a. `QuickFilter` — pure string heuristic rejects procedural exchanges
+   b. Length gate — responses < `ingest_min_len` (default 80 chars) skipped
+   c. Content score gate — raw text embedded & scored against noise prototypes; below `content_score_pre_gate` (default 0.35) → skipped
+3. LLM synthesis gate (`SynthesizeQA`) — Haiku distills or returns "SKIP"
+4. Text chunked at paragraph boundaries (~512 tokens)
+5. Noise filtered (< 20 chars, < 40% alphanumeric)
+6. Secrets redacted before embedding
+7. All chunks batch-embedded in single HTTP call to llama.cpp
+8. Each chunk dedup-checked against store (cosine ≥ 0.92 = skip)
+9. Similar-to-source chunks tagged as extensions (cosine ≥ 0.75)
+10. Stored. All of this runs async in a goroutine — zero latency to Claude.
+
+**Rejection store** (adaptive noise learning):
+- Exchanges rejected by QuickFilter or synthesizer are logged to a ring buffer (500 entries)
+- Every 25 rejections, assistant texts are re-embedded as noise prototypes
+- Hot-swapped into the ContentScorer — the system learns what noise looks like
+- Persisted as JSONL at `~/.memoryd/rejection_log.jsonl`
 
 **Steward** (hourly background sweep):
 1. Score memories: `log2(hit_count + 1) / log2(maxHits + 1) × 0.5^(timeSinceRetrieval / 7d)`
@@ -174,6 +186,12 @@ type Embedder interface {
 | `RetrievalTopK` | 5 | config default | Memories per search |
 | `RetrievalMaxTokens` | 2048 | config default | Context budget for injection |
 | `QualityLearningThreshold` | 50 | quality/ | Retrievals before quality filtering activates |
+| `IngestMinLen` | 80 | config/PipelineConfig | Responses shorter than this skip Haiku entirely |
+| `ContentScorePreGate` | 0.35 | config/PipelineConfig | Pre-Haiku noise gate: below this → skip |
+| `noiseTopK` | 3 | quality/content.go | Top-K noise prototypes used in scoring (prevents dilution) |
+| `maxRejectionProtos` | 150 | quality/content.go | Max rejection texts used as noise prototypes |
+| `RebuildEvery` | 25 | rejection/store.go | Rejections between scorer rebuilds |
+| `DefaultMaxSize` | 500 | rejection/store.go | Ring buffer capacity |
 | `PruneThreshold` | 0.1 | steward config | Score below which memories get pruned |
 | `PruneGracePeriod` | 24h | steward config | Minimum age before pruning eligible |
 | `DecayHalfLife` | 90d | steward config | Unretrieved memory score half-life |
@@ -204,6 +222,10 @@ steward:
   decay_half_days: 90
   merge_threshold: 0.88
   batch_size: 500
+
+pipeline:
+  ingest_min_len: 80              # responses < this skip Haiku entirely
+  content_score_pre_gate: 0.35    # pre-Haiku noise gate threshold
 ```
 
 ---
@@ -324,6 +346,10 @@ export ANTHROPIC_BASE_URL=http://127.0.0.1:7432  # point Claude Code at it
 
 10. **Graceful shutdown.** The daemon catches SIGINT/SIGTERM, stops the steward, stops the HTTP server, then cancels context. The order matters.
 
+11. **Content score pre-gate does NOT feed rejection store.** Exchanges filtered by the content score gate are NOT added to the rejection store — only QuickFilter and synthesizer rejections feed back. This prevents a positive feedback loop where the scorer would amplify its own noise signal.
+
+12. **Top-K noise scoring, not averaging.** The ContentScorer uses the top-3 most similar noise prototypes, not the average of all. When the rejection store grows to 150+ entries, averaging would converge to a constant, destroying discriminative power.
+
 ---
 
 ## Memory Data Model
@@ -370,3 +396,6 @@ Additional collections: `retrieval_events`, `sources`, `source_pages`
 - Write path changes go in `pipeline/write.go`
 - Context formatting in `pipeline/inject.go`
 - Keep write path async — never block the response to Claude
+- Pre-Haiku gate changes go in `proxy/anthropic.go` and `proxy/api.go`
+- Rejection store logic is in `rejection/store.go`
+- Content scoring prototypes and noise learning are in `quality/content.go`
@@ -58,3 +58,5 @@ When you store a new memory that's similar (but not identical) to an existing so
 ## Adaptive quality learning
 
 The system tracks which memories get retrieved and how often. While in "learning mode" (< 50 retrieval events), it keeps everything. Use `quality_stats` to check the current learning status. Over time, memories that are never retrieved will score lower, helping the system learn what's worth keeping.
+
+The system also learns what **noise** looks like. Exchanges rejected by the pre-filter or synthesizer are accumulated in a ring buffer. Every 25 rejections, the assistant texts are re-embedded as noise prototypes and hot-swapped into the content scorer. This means the system adapts to your team's specific noise patterns — the more it sees procedural chatter, the better it gets at filtering it before spending an LLM call.
@@ -70,7 +70,7 @@ See the **[Getting Started](https://memory-daemon.github.io/memoryd/getting-star
 
 ### [Knowledge Capture](https://memory-daemon.github.io/memoryd/how-it-works/write-path)
 
-Every AI response is captured asynchronously (zero latency impact), broken into meaningful pieces, scrubbed of secrets (API keys, tokens, passwords — 13 detection patterns), deduplicated, and stored in the shared database.
+Every AI response is captured asynchronously (zero latency impact), passed through a multi-stage quality filter (length gate, adaptive content scoring, LLM synthesis gate), scrubbed of secrets (API keys, tokens, passwords — 13 detection patterns), deduplicated, and stored in the shared database. The system learns what noise looks like from rejected exchanges, improving filtering accuracy over time.
 
 ### [Context Retrieval](https://memory-daemon.github.io/memoryd/how-it-works/read-path)
 
@@ -112,7 +112,8 @@ internal/
     mongo.go              MongoDB implementation
     atlas.go              Atlas hybrid search (vector + text + RRF + MMR)
   redact/                 Secret scrubbing (13 patterns)
-  quality/                Usage tracking and quality scoring
+  quality/                Usage tracking, content scoring, adaptive noise learning
+  rejection/              Rejection store — ring buffer for adaptive noise prototype learning
   steward/                Background maintenance (score → prune → merge)
   ingest/                 Source ingestion and change detection
   crawler/                Web crawler with change detection
@@ -152,6 +153,10 @@ steward:
   prune_threshold: 0.1
   merge_threshold: 0.88
   decay_half_days: 90
+
+pipeline:
+  ingest_min_len: 80                  # Skip short responses before LLM call
+  content_score_pre_gate: 0.35        # Adaptive noise score threshold
 ```
 
 See the full **[Configuration Reference](https://memory-daemon.github.io/memoryd/configuration)**.
@@ -181,6 +186,7 @@ cd website && npm start
 - [x] Quality maintenance (scoring, pruning, merging)
 - [x] Atlas hybrid search (vector + text + RRF + MMR)
 - [x] Secret scrubbing (13 detection patterns)
+- [x] Adaptive noise filtering (pre-Haiku gates, rejection-based learning)
 - [x] Documentation site
 - [x] macOS menu bar app
 - [ ] Team-scoped knowledge (overlapping layers per team/BU)
 
@@ -647,6 +647,8 @@ By redacting before embedding, the vector captures the semantic meaning of the s
 | **Change detection** | SHA-256 per page/file | Not applicable (each response is new) |
 | **Redaction** | Yes — `redact.Clean()` per section | Yes — `redact.Clean()` per chunk |
 | **Noise filtering** | Drop sections < 30 chars | Drop chunks < 20 chars or < 40% alphanumeric |
+| **Pre-LLM gates** | None — ingested content is assumed worth embedding | 3-stage: QuickFilter → length gate (< 80 chars) → content score gate (< 0.35) |
+| **Adaptive learning** | None | Rejection store feeds noise prototypes back into content scorer |
 | **Embedding** | Batch per page (all sections in one call) | Single or batch per response |
 | **Execution** | Async goroutine, 30-min timeout | Async goroutine, fire-and-forget |
 
 
@@ -880,10 +880,12 @@ When LLM synthesis is enabled, the proxy does more than raw capture:
 
 **Per-exchange (`ingest()`):**
 1. Extract the last user message from the request
-2. **Pre-filter:** `QuickFilter()` checks if both user and assistant messages are procedural → reject immediately
-3. **LLM quality gate:** `SynthesizeQA()` asks the model to distill or return `"SKIP"` → reject if no durable value
-4. **Store:** Distilled entry goes through `ProcessDirect()` (no chunking, already formatted)
-5. **Fallback:** If no synthesizer, store raw Q&A pair
+2. **Pre-filter:** `QuickFilter()` checks if both user and assistant messages are procedural → reject immediately (feeds rejection store)
+3. **Length gate:** Responses shorter than `ingest_min_len` (default 80 chars) are skipped — no LLM call
+4. **Content score gate:** Raw assistant text is embedded and scored against noise prototypes via `PreScore()`. Below `content_score_pre_gate` (default 0.35) → skipped. Does NOT feed rejection store (prevents positive feedback loop)
+5. **LLM quality gate:** `SynthesizeQA()` asks the model to distill or return `"SKIP"` → reject if no durable value (feeds rejection store)
+6. **Store:** Distilled entry goes through `ProcessDirect()` (no chunking, already formatted)
+7. **Fallback:** If no synthesizer, store raw Q&A pair
 
 **Session synthesis:**
 - Fired at 3 complete Q&A pairs, then every 5 pairs after
@@ -990,6 +992,10 @@ Created automatically on first run with sensible defaults. All pipeline threshol
 | `sessionSynthesisInterval` | 5 | proxy/anthropic.go | Pairs between subsequent summaries |
 | `RebuildEvery` (rejection) | 25 | rejection/store.go | Rejections between scorer rebuilds |
 | `DefaultMaxSize` (rejection) | 500 | rejection/store.go | Max entries in rejection ring buffer |
+| `IngestMinLen` | 80 | config/PipelineConfig | Responses shorter than this skip Haiku entirely |
+| `ContentScorePreGate` | 0.35 | config/PipelineConfig | Pre-Haiku noise gate: below this → skip |
+| `noiseTopK` | 3 | quality/content.go | Top-K noise prototypes used in scoring |
+| `maxRejectionProtos` | 150 | quality/content.go | Max rejection texts used as noise prototypes |
 
 ---
 
 
@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+"""Analyze the HF dataset for diversity and create a larger benchmark dataset."""
+import json
+import random
+import collections
+import sys
+
+def analyze():
+    rows = []
+    with open("data/dataset-hf.jsonl") as f:
+        for line in f:
+            rows.append(json.loads(line))
+
+    print(f"Total rows: {len(rows)}")
+
+    # Check diversity by looking at user_prompt first 50 chars
+    prefixes = collections.Counter()
+    for r in rows:
+        p = r.get("user_prompt", "")[:50]
+        prefixes[p] += 1
+
+    print(f"Unique prompt prefixes (50ch): {len(prefixes)}")
+    print("Top 10:")
+    for prefix, count in prefixes.most_common(10):
+        print(f"  {count:>5}x  {repr(prefix[:60])}")
+
+    # Check response length distribution
+    lens = [len(r.get("assistant_response", "")) for r in rows]
+    lens.sort()
+    print(f"\nResponse length: min={lens[0]}, p25={lens[len(lens)//4]}, "
+          f"median={lens[len(lens)//2]}, p75={lens[3*len(lens)//4]}, max={lens[-1]}")
+
+    # Check for content type diversity in random 1000
+    random.seed(42)
+    sample = random.sample(rows, 1000)
+    short = sum(1 for r in sample if len(r.get("assistant_response", "")) < 80)
+    medium = sum(1 for r in sample if 80 <= len(r.get("assistant_response", "")) < 500)
+    long_resp = sum(1 for r in sample if len(r.get("assistant_response", "")) >= 500)
+    print(f"\nRandom 1000 sample: short(<80)={short}, medium(80-500)={medium}, long(500+)={long_resp}")
+
+    # Check for actual content patterns in the sample
+    ack_patterns = ["Sure", "I'll", "Let me", "Here", "OK", "Done", "Got it", "Understood"]
+    ack_count = 0
+    code_count = 0
+    for r in sample:
+        resp = r.get("assistant_response", "")
+        if any(resp.strip().startswith(p) for p in ack_patterns) and len(resp) < 200:
+            ack_count += 1
+        if "```" in resp or "func " in resp or "def " in resp or "class " in resp:
+            code_count += 1
+
+    print(f"Acknowledgments (short + starts with ack pattern): {ack_count}")
+    print(f"Contains code blocks/definitions: {code_count}")
+
+    # Show sample of short responses
+    print("\nSample short responses:")
+    short_samples = [r for r in sample if len(r.get("assistant_response", "")) < 100]
+    random.shuffle(short_samples)
+    for r in short_samples[:10]:
+        resp = r.get("assistant_response", "").strip()[:100]
+        print(f"  [{len(r['assistant_response']):>4}ch] {repr(resp)}")
+
+    # Show some medium responses
+    print("\nSample medium responses:")
+    med_samples = [r for r in sample if 200 <= len(r.get("assistant_response", "")) < 600]
+    random.shuffle(med_samples)
+    for r in med_samples[:5]:
+        resp = r.get("assistant_response", "").strip()[:120]
+        print(f"  [{len(r['assistant_response']):>4}ch] {repr(resp)}")
+
+if __name__ == "__main__":
+    analyze()
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+"""Clean cross-run analysis excluding errors."""
+import json
+
+dataset = []
+with open("data/eval-large.jsonl") as f:
+    for line in f:
+        dataset.append(json.loads(line))
+
+for run_idx, fname in enumerate(["benchmark-1k-r1.jsonl", "benchmark-1k-r2.jsonl", "benchmark-1k-r3.jsonl"], 1):
+    results = []
+    with open(fname) as fh:
+        for line in fh:
+            results.append(json.loads(line))
+
+    valid = [r for r in results if r["stage"] != "error"]
+    total_valid = len(valid)
+
+    stages = {}
+    for r in valid:
+        stages[r["stage"]] = stages.get(r["stage"], 0) + 1
+
+    pre_haiku = stages.get("pre_filter", 0) + stages.get("length_filter", 0) + stages.get("content_score_filter", 0)
+    haiku_calls = stages.get("synthesizer_skip", 0) + stages.get("stored", 0)
+
+    mixed_sub_total = 0
+    mixed_sub_stored = 0
+    for r in valid:
+        idx = r["index"]
+        lbl = dataset[idx].get("label", "")
+        resp = dataset[idx].get("assistant_response", "")
+        prompt = dataset[idx].get("user_prompt", "")
+        is_mixed = "hyperswitch" not in resp.lower() and "hyperswitch" not in prompt.lower()
+        if lbl == "substantive" and is_mixed:
+            mixed_sub_total += 1
+            if r["stage"] == "stored":
+                mixed_sub_stored += 1
+
+    print(f"Run {run_idx} (excl {len(results)-len(valid)} errors):")
+    print(f"  Valid entries:    {total_valid}")
+    print(f"  Pre-Haiku:       {pre_haiku} ({pre_haiku/total_valid*100:.0f}%)")
+    print(f"    length_filter:   {stages.get('length_filter', 0)}")
+    print(f"    content_score:   {stages.get('content_score_filter', 0)}")
+    print(f"  Haiku calls:     {haiku_calls} ({haiku_calls/total_valid*100:.0f}%)")
+    print(f"    stored:          {stages.get('stored', 0)}")
+    print(f"    synth_skip:      {stages.get('synthesizer_skip', 0)}")
+    print(f"  Hand-crafted substantive recall: {mixed_sub_stored}/{mixed_sub_total} ({mixed_sub_stored/max(mixed_sub_total,1)*100:.0f}%)")
+    print()
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+"""Detailed cross-run analysis for the 1000-row benchmark."""
+import json, sys, collections
+
+dataset_path = "data/eval-large.jsonl"
+run_files = ["benchmark-large1.jsonl", "benchmark-large2.jsonl", "benchmark-large3.jsonl"]
+
+# Load ground-truth labels and origin
+labels = {}
+origins = {}  # "mixed" or "hf"
+with open(dataset_path) as f:
+    for i, line in enumerate(f):
+        d = json.loads(line)
+        labels[i] = d.get("label", "unknown")
+        # Hand-crafted entries have shorter responses (< 2000 chars typically)
+        # and specific labels, while HF responses can be very long
+        # We can identify by checking if user_prompt is from eval-mixed patterns
+        resp = d.get("assistant_response", "")
+        # Simple heuristic: eval-mixed entries don't contain hyperswitch references
+        if "hyperswitch" in resp.lower() or "hyperswitch" in d.get("user_prompt", "").lower():
+            origins[i] = "hf"
+        elif len(resp) < 3000 and labels[i] in ("noise", "low", "substantive"):
+            origins[i] = "mixed"  # likely hand-crafted
+        else:
+            origins[i] = "hf"
+
+for run_idx, run_file in enumerate(run_files, 1):
+    results = []
+    with open(run_file) as f:
+        for line in f:
+            results.append(json.loads(line))
+
+    print(f"{'='*60}")
+    print(f"RUN {run_idx}: {run_file}")
+    print(f"{'='*60}")
+
+    # Stage counts
+    stages = collections.Counter(r["stage"] for r in results)
+    print(f"\nStage distribution:")
+    for stage in ["pre_filter", "length_filter", "content_score_filter", "synthesizer_skip", "stored", "error"]:
+        print(f"  {stage:<24} {stages.get(stage, 0):>4}")
+
+    # Count Haiku calls = total - pre_filter - length_filter - content_score_filter
+    pre_haiku = stages.get("pre_filter", 0) + stages.get("length_filter", 0) + stages.get("content_score_filter", 0)
+    haiku_calls = len(results) - pre_haiku - stages.get("error", 0)
+    print(f"\n  Pre-Haiku filtered:      {pre_haiku}")
+    print(f"  Haiku calls:             {haiku_calls}")
+
+    # Substantive recall split by origin
+    print(f"\nSubstantive recall by origin:")
+    for origin in ["mixed", "hf"]:
+        stored = 0
+        total = 0
+        for r in results:
+            idx = r["index"]
+            if labels.get(idx) == "substantive" and origins.get(idx) == origin:
+                total += 1
+                if r["stage"] == "stored":
+                    stored += 1
+        if total > 0:
+            print(f"  {origin:>5}: {stored}/{total} stored ({stored/total*100:.0f}%)")
+
+    # Filtered substantive by stage, split by origin
+    print(f"\nFiltered substantive by stage+origin:")
+    for origin in ["mixed", "hf"]:
+        stage_counts = collections.Counter()
+        for r in results:
+            idx = r["index"]
+            if labels.get(idx) == "substantive" and origins.get(idx) == origin and r["stage"] != "stored":
+                stage_counts[r["stage"]] += 1
+        if stage_counts:
+            print(f"  {origin}: {dict(stage_counts)}")
+
+    print()