EtanHey · EtanHey · May 21, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
@@ -0,0 +1,87 @@
+# Phase 4a — Eval Framework
+
+> **Status**: SEED PR (DRAFT) — data files only. Runner, metrics, and CI gate are next iteration.
+>
+> **Purpose**: 80+ query eval set with Hebrew weighting that GATES any retrieval-shape change (Phase 4b hnlx + Int8 + dual-datastore cannot ship without this passing).
+>
+> **Source dispatch brief**: `~/Gits/orchestrator/docs.local/handoffs/2026-05-22/phase-4a-eval-framework-dispatch.md`
+
+## Directory contents
+
+```
+evals/phase4a/
+├── README.md          # this file
+├── queries.yaml       # 80 queries: 15 Hebrew + 12 health + 15 conceptual + 15 frustration + 8 temporal + 15 entity
+├── sentinel.yaml      # 5 fast pre-commit smoke queries (one per category)
+├── thresholds.yaml    # pass criteria (recall@20 per category, ndcg@10, latency budgets)
+├── runner.py          # TODO next iteration: invokes brain_search × queries, captures metrics
+├── metrics.py         # TODO next iteration: Ranx + RAGAS + DeepEval wrappers
+├── ci_gate.py         # TODO next iteration: CLI `bl-eval smoke` / `bl-eval full --compare-to baseline.json`
+├── baseline.json      # TODO next iteration: committed snapshot of metrics from current DB
+└── tests/             # TODO next iteration: test_runner / test_metrics / test_ci_gate
+```
+
+## Why this is a SEED PR
+
+The 80 queries are pre-curated based on:
+- Today's BrainBar verification work (Hebrew probes from CD-1 latency report)
+- Etan's verbatim corrections captured in BrainLayer (`fru-*` category)
+- coachClaude domain (`hlt-*` category)
+- Known entities from BrainLayer's `kg_entities` (`ent-*` category)
+
+Landing the query set FIRST lets the runner be developed against fixed data. Subsequent commits add the framework.
+
+## Categories and weighting (per `queries.yaml`)
+
+| Category | Count | Purpose | Failure tolerance |
+|----------|-------|---------|-------------------|
+| `hebrew` | 15 | Token coverage + trigram fuzziness + cross-script transliteration (Bug E) | 85% (heb-05 is known miss pre-multilingual-embed) |
+| `health` | 12 | coachClaude alignment + WHOOP/sleep/recovery domain | 92% |
+| `conceptual` | 15 | Abstract retrieval — phrasing variability tests | 90% |
+| `frustration` | 15 | Recurring user corrections — MUST surface | 95% |
+| `temporal` | 8 | Time-anchored retrieval — recency intent tests | 85% |
+| `entity` | 15 | Known kg_entities — baseline anchor | 95% |
+
+## Sentinel set (`sentinel.yaml`)
+
+5 queries that run in <30s total for pre-commit hooks. One representative from each category. All must pass; CI workflow blocks if any sentinel fails.
-5 queries that run in <30s total for pre-commit hooks. One representative from each category. All must pass; CI workflow blocks if any sentinel fails.
+5 queries that run in <30s total for pre-commit hooks. One representative from each category except `temporal`. All must pass; CI workflow blocks if any sentinel fails.
-5 queries that run in <30s total for pre-commit hooks. One representative from each category. All must pass; CI workflow blocks if any sentinel fails.
+5 queries that run in <30s total for pre-commit hooks. One representative from each category except `temporal`. All must pass; CI workflow blocks if any sentinel fails.
+
+## Thresholds (`thresholds.yaml`)
+
+- `recall@20` minimum 90% per category (with category overrides per known-miss tolerance)
+- `ndcg@10` aggregate ≥0.85
+- No category may regress more than 5% vs `baseline.json`
+- Latency: sentinel total <30s, full eval <120s, per-query p95 <500ms, per-query max <8s
+
+## How Phase 4b will use this
+
+Phase 4b (hnlx + Int8 + dual-datastore) MUST pass:
+- All sentinel queries (smoke pre-commit)
+- Full 80-query eval ≥thresholds (CI gate before merge)
+
+If Phase 4b regresses >5% on any category → block merge, iterate or split into smaller PRs.
+
+## Why this lands BEFORE the runner
+
+`/post-merge-deploy-check` lesson from 2026-05-22: data + code shipping in lockstep risks subtle drift. By shipping queries first:
+- Queries can be reviewed independently for accuracy (Hebrew spelling, entity names, frustration phrasing)
+- Runner can be developed against fixed query data (TDD-friendly)
+- Future query updates don't require Python changes
+
+## Next iteration
+
+After this DRAFT PR is reviewed for query accuracy:
+1. Land `runner.py` (~50 LOC) — invokes brain_search, captures latency + result IDs
+2. Land `metrics.py` (~100 LOC) — Ranx + RAGAS + DeepEval wrappers
+3. Land `ci_gate.py` (~50 LOC) — CLI entry + threshold comparison
+4. Land `baseline.json` — initial snapshot from current DB
+5. Land `tests/` (~200 LOC) — TDD coverage
+6. Land `.github/workflows/eval.yml` — CI wiring
+
+Then mark PR ready-for-review + merge.
+
+## Cross-references
+
+- Dispatch brief: `~/Gits/orchestrator/docs.local/handoffs/2026-05-22/phase-4a-eval-framework-dispatch.md`
+- Phase 4b skeleton: `~/Gits/orchestrator/docs.local/handoffs/2026-05-22/phase-4b-hnlx-int8-dual-datastore-dispatch.md`
+- Binding design doc: `~/Gits/orchestrator/docs.local/plans/2026-05-21-brainlayer-readpath-redesign/PHASE4-DESIGN.md`
@@ -0,0 +1,146 @@
+"""Phase 4a CI gate — `bl-eval smoke` / `bl-eval full --compare-to baseline.json`.
+
+Wraps runner.py + metrics.py for CI invocation. Returns exit codes:
+- 0 PASS (all checks satisfied)
+- 1 FAIL (verdict failed OR runner errored)
+- 2 WARN (partial — e.g., baseline missing but eval ran)
+
+Usage:
+    python evals/phase4a/ci_gate.py smoke
+        # Runs sentinel.yaml, asserts every query returns non-empty within latency budget
+
+    python evals/phase4a/ci_gate.py full --baseline evals/phase4a/baseline.json
+        # Runs all 80 queries, compares to baseline, asserts no category regresses >5%
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+# Module-relative imports
+HERE = Path(__file__).parent
+sys.path.insert(0, str(HERE))
+
+from runner import run_eval  # noqa: E402
+from metrics import compute_all, compare_to_baseline  # noqa: E402
+
+
+def smoke(args: argparse.Namespace) -> int:
+    queries_path = args.queries or (HERE / "sentinel.yaml")
+    output_path = args.output or Path("/tmp/phase4a-smoke-results.json")
+    print(f"=== smoke: running {queries_path.name} ===")
+
+    summary = run_eval(queries_path, output_path, n_results=args.num or 5)
+
+    # Smoke gates:
+    # 1. Wall-clock total < 30s
+    # 2. Every query returns at least 1 chunk OR has explicit `expected_min_recall_at_20: 0`
+    # 3. No query timed out
+    elapsed_s = summary.get("elapsed_total_seconds", 0)
+    if elapsed_s > 30:
+        print(f"FAIL: smoke total wall-clock {elapsed_s:.1f}s > 30s budget")
+        return 1
+
+    n_failures = 0
+    for r in summary.get("results", []):
+        qid = r.get("id", "?")
+        if r.get("error"):
+            print(f"FAIL: {qid} errored: {r['error'][:120]}")
+            n_failures += 1
+            continue
+        if r.get("n_returned", 0) == 0:
+            print(f"FAIL: {qid} returned no results (query: {r.get('query', '')[:50]})")
+            n_failures += 1
+
+    if n_failures:
+        print(f"=== smoke FAILED: {n_failures} sentinel queries failed ===")
+        return 1
+
+    lat = summary.get("aggregate_latency", {})
+    print(
+        f"=== smoke PASSED: {summary['n_queries']} queries in {elapsed_s:.1f}s "
+        f"(p50={lat.get('p50_ms', 0):.0f}ms p95={lat.get('p95_ms', 0):.0f}ms max={lat.get('max_ms', 0):.0f}ms) ==="
+    )
+    return 0
+
+
+def full(args: argparse.Namespace) -> int:
+    queries_path = args.queries or (HERE / "queries.yaml")
+    output_path = args.output or Path("/tmp/phase4a-full-results.json")
+    print(f"=== full: running {queries_path.name} ===")
+
+    summary = run_eval(queries_path, output_path, n_results=args.num or 20)
+    metrics = compute_all(summary)
+
+    # If a baseline is supplied, compare
+    baseline_path = args.baseline
+    thresholds_path = args.thresholds or (HERE / "thresholds.yaml")
+    if baseline_path and baseline_path.exists() and thresholds_path.exists():
+        try:
+            import yaml
+        except ImportError:
+            print("WARN: PyYAML not available — cannot load thresholds; skipping comparison")
+            return 2
+
+        baseline_data = json.loads(baseline_path.read_text())
+        thresholds_data = yaml.safe_load(thresholds_path.read_text())
+        # If baseline is itself a metrics output (compute_all shape), use directly
+        baseline_metrics = baseline_data if "per_category" in baseline_data else compute_all(baseline_data)
+        verdict = compare_to_baseline(metrics, baseline_metrics, thresholds_data)
+
+        # Persist current run + verdict
+        if args.output:
+            args.output.write_text(json.dumps({
+                "summary": summary,
+                "metrics": metrics,
+                "verdict": verdict,
+            }, indent=2, ensure_ascii=False))
+
+        if verdict["passed"]:
+            print(f"=== full PASSED ({metrics['n_queries']} queries) ===")
+            print(f"  overall: recall@20={metrics['overall']['recall_at_20_mean']} "
+                  f"ndcg@10={metrics['overall']['ndcg_at_10_mean']}")
+            return 0
+        else:
+            print(f"=== full FAILED: {verdict['n_failures']} threshold breaches ===")
+            for f in verdict["failures"]:
+                print(f"  - {f}")
+            return 1
+
+    # No baseline: just emit metrics for review (WARN exit)
+    print(f"=== full eval run COMPLETE (no baseline for comparison; WARN exit) ===")
+    print(json.dumps(metrics["overall"], indent=2))
+    return 2
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(prog="bl-eval", description="Phase 4a CI gate")
+    sub = parser.add_subparsers(dest="cmd", required=True)
+
+    p_smoke = sub.add_parser("smoke", help="Run sentinel queries (target <30s)")
+    p_smoke.add_argument("--queries", type=Path, help="Override sentinel.yaml path")
+    p_smoke.add_argument("--output", type=Path, help="Override output JSON path")
+    p_smoke.add_argument("--num", type=int, help="Override num_results per query")
+
+    p_full = sub.add_parser("full", help="Run all 80 queries + optional baseline compare")
+    p_full.add_argument("--queries", type=Path, help="Override queries.yaml path")
+    p_full.add_argument("--output", type=Path, help="Override output JSON path")
+    p_full.add_argument("--baseline", type=Path, help="baseline.json for regression comparison")
+    p_full.add_argument("--thresholds", type=Path, help="Override thresholds.yaml path")
+    p_full.add_argument("--num", type=int, help="Override num_results per query")
+
+    args = parser.parse_args(argv)
+
+    if args.cmd == "smoke":
+        return smoke(args)
+    if args.cmd == "full":
+        return full(args)
+    parser.error(f"unknown command: {args.cmd}")
+    return 2
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())