diff --git a/docs/specs/polish-fact-check/decisions.md b/docs/specs/polish-fact-check/decisions.md new file mode 100644 index 0000000..1784f85 --- /dev/null +++ b/docs/specs/polish-fact-check/decisions.md @@ -0,0 +1,43 @@ +# Spec: Polish Fact-Check — Decisions + +> Pre-committed decisions per the existing lesson "Pre-committed +> decision matrices survive contact with data." Edits to this file +> after Phase 1 ships require a follow-up PR with rationale. + +--- + +## Decision matrix + +| Decision | Choice | Rationale | +|---|---|---| +| Phase 1 default failure mode | Soft-fail (`## Unresolved references` block at bottom of file) | Lets us measure noise vs signal before tightening the gate. Mirrors test-quality-program's "measure first, gate later" rubric pattern. | +| Phase 1 strict-fail escalation criterion | Move to strict-fail if soft-fail rate drops below 5% across two consecutive **weekly** regens | Weekly cadence matches the help-system's intended regen rhythm; monthly would delay the escalation decision unnecessarily. | +| Phase 1 numeric-claim check | Required (not stretch) | Patrick tightened the acceptance gate from 4/6 to 5/6 errors caught at Phase 1 ship. Numeric claims are AST-pattern-detectable; only the "missing security callout" failure mode stays for Phase 3. | +| Phase 1 CLI-ref version coupling | Acceptable, with proactive user messaging | When a flag isn't found in `attune --help`, the finding message includes (a) the installed attune-ai version, (b) instructions to verify against the target version, (c) override snippet. See `design.md` § Check 2. | +| Phase 3 default faithfulness threshold | `0.95` (mean across paragraphs in a single file) | Untested at spec-draft time. Will be **calibrated** against the ops-dashboard regression fixture in Phase 3 task #3.3 before defaulting. If calibration shows pre-fix mean ≥ 0.9 or post-fix mean < 0.95, the threshold gets re-decided. | +| Phase 3 threshold override mechanism | `pyproject.toml` `[tool.attune-author.fact-check]` + per-invocation CLI flag | Two-level override: project-wide config for sustained policy, CLI flag for one-off runs. | +| Phase 3 budget cap | Skip judge call if estimated cost > `$0.10` for a single feature regen | Hard cap protects against unexpected cost when regenerating a feature with many kinds. Configurable. | +| Phase 4 default | Tier 0 (static analysis only); execution requires explicit opt-in | Static analysis catches the documented failure modes (e.g. `_readers` private-module hallucinations) without executing untrusted LLM-generated code. Execution tiers documented in design.md for Phase 4.2 follow-up. | +| Phase 4 execution opt-in mechanism | `# attune-author: exec` frontmatter on individual code samples | Sample-level granularity, not file-level — keeps the human reviewer responsible for confirming each blessed sample has no side effects. | +| Spec-file convention going forward | Include `decisions.md` alongside `requirements.md` / `design.md` / `tasks.md` | Patrick's call (2026-05-14). Extracts pre-committed decisions from the spec body so they're easy to audit and update independently. | + +--- + +## Calibration record + +To be filled in during Phase 3 implementation: + +- [ ] **Phase 3 threshold calibration** — Phase 3 task #3.3 / #3.4 + - Pre-fix ops-dashboard mean faithfulness score: _TBD_ + - Post-fix ops-dashboard mean faithfulness score: _TBD_ + - Default threshold after calibration: _TBD_ + +--- + +## Decision-change log + +> Append entries here when a decision above is revised. Reference the PR +> that revised it. + +- 2026-05-14 — Initial decisions captured during spec draft. Patrick + approved. diff --git a/docs/specs/polish-fact-check/design.md b/docs/specs/polish-fact-check/design.md new file mode 100644 index 0000000..6c27a34 --- /dev/null +++ b/docs/specs/polish-fact-check/design.md @@ -0,0 +1,408 @@ +# Spec: Polish Fact-Check — Design + +## Phase 2: Design + +**Status**: draft + +--- + +## Overall architecture + +The four phases are independent layers stacked on the existing polish +pipeline. Each phase ships as its own PR and is opt-in via configuration +until Phase 4 (at which point all are on by default). + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Existing polish pipeline (attune_author/polish.py) │ +│ │ +│ source files ──► prompt builder ──► LLM ──► polished output │ +│ ▲ │ │ +│ │ │ │ +└──────────── Phase 2 ───┼──────────── Phase 1 ─────┼──────────────┘ + │ │ + │ ▼ + ┌──────────┴──────────┐ ┌──────────────────────┐ + │ Ground-truth │ │ AST fact-check pass │ + │ context injection │ │ (post-generation) │ + │ • CLI --help │ │ • imports │ + │ • __all__ list │ │ • CLI flags │ + │ • dataclass fields│ │ • md links │ + └─────────────────────┘ │ • numeric claims* │ + └──────────┬───────────┘ + │ + ▼ + ┌──────────────────────┐ + │ Phase 3: faithfulness│ + │ judge (attune-rag) │ + └──────────┬───────────┘ + │ + ▼ + ┌──────────────────────┐ + │ Phase 4: static check│ + │ of tutorial samples │ + │ (mypy + ast.parse) │ + └──────────────────────┘ + +* Numeric-claim checking is Phase 1 stretch; documented but skippable. +``` + +--- + +## Phase 1 — AST-based post-generation fact-check + +### Module layout + +New module `src/attune_author/fact_check/` containing: + +``` +fact_check/ +├── __init__.py # public entry: check_polished_file(path, source_paths) +├── python_refs.py # import-statement and dotted-path verification +├── cli_refs.py # `attune --flag` verification +├── md_links.py # [label](target.md) target existence +└── report.py # collect findings, format soft-fail block +``` + +### Public API + +```python +def check_polished_file( + polished_path: Path, + source_paths: list[Path], + *, + project_root: Path, + config: FactCheckConfig | None = None, +) -> FactCheckReport +``` + +- `polished_path` — the just-written polished markdown file +- `source_paths` — the source `.py` files the polish pass had as context +- `project_root` — used to resolve relative markdown links and run + ` --help` invocations +- `config` — optional config; defaults come from `pyproject.toml` + +Returns a `FactCheckReport` with a list of `Finding(severity, location, +message)` entries. The caller decides whether to soft-fail (append to +file) or strict-fail (raise). + +### Check 1: Python imports and dotted-path references + +For each Python code fence and inline `attune.X.Y` reference in the +polished file: + +1. Extract import statements with `ast.parse` and walk the tree. +2. For each `from X import Y`, attempt `importlib.import_module(X)` in + the active venv, then `getattr(module, Y)`. Failure = unresolved. +3. For each prose reference matching `attune\.[a-z_.]+\.[A-Za-z_]+`, + resolve the full dotted path the same way. + +Why import in the venv rather than parse-only: catches the +`attune.ops._readers` class of bug where the path *parses* fine but +doesn't actually exist. This was the most damaging failure mode in the +ops-dashboard fixture and parse-only AST won't catch it. + +### Check 2: CLI flag references + +For each `attune ` pattern in the polished file: + +1. Build a cache: for each subcommand referenced, run + `attune --help` once and parse the output for flag + names (regex: `--[a-z][a-z0-9-]*`). +2. For each `--flag` in the polished doc, confirm it appears in the + cached help output for that subcommand. +3. Unknown subcommands are themselves a finding. + +Cache scope: per-file, so a regen of one feature doesn't reinvoke +`--help` many times. Cache invalidation isn't needed (cache lives only +for the duration of the check call). + +#### Version coupling — user-facing messaging + +CLI-ref checks resolve against whatever `attune-ai` is installed in the +active venv. If a consumer is regenerating docs against a different +attune-ai version, false positives are possible. Every CLI-ref finding +includes proactive context so the user can resolve the ambiguity +without spelunking: + +``` +Line 17 (prose): `attune ops --read-only` — flag not found. + +Detected against attune-ai 6.8.0 (installed in active venv). If you +are regenerating against a different attune-ai version, verify the +flag exists in that version's `attune ops --help`. + +To override: + - One-off: attune-author generate FEATURE --skip-check cli_refs + - Per file: [tool.attune-author.fact-check.skip] + "docs/how-to/ops-dashboard.md" = ["check_cli_refs"] +``` + +The installed version is read from `attune.__version__` (or the +`importlib.metadata` fallback). If the consumer CLI isn't `attune` — +e.g. a third-party project using attune-author against its own CLI — +the same template renders with that project's CLI name and version. + +### Check 3: Markdown link targets + +For each `[label](path/to.md)` or `[label](path/to.md#anchor)`: + +1. Resolve relative to `polished_path`'s directory. +2. Confirm the target file exists. +3. (Stretch: anchor existence by parsing target headers.) + +External URLs (`http://...`, `https://...`) are skipped — not in scope. + +### Check 4: Numeric claims + +Promoted from stretch to required (per decisions.md). Catches the +`498 templates`-class of hallucination where the LLM invents a count +that has no source-of-truth. + +For each sentence matching patterns like `\d+\s+(templates|features|workflows|skills|agents|workflows|tools|kinds)`: + +1. Extract the noun and the count. +2. For nouns we can verify deterministically, count against the project + filesystem: + - `templates` → `find .help/templates -name "*.md" | wc -l` + - `features` → number of top-level keys in `.help/features.yaml` + - `workflows` → `list_workflows()` from the consumer's registry (if + declared in `features.yaml`) + - `kinds` → constant from attune-author (currently 11) +3. Severity: `error` if the doc's count doesn't match the verified count, + `warning` if the noun isn't in the verifiable set (still surfaced for + human review). + +For nouns we *can't* verify (e.g., "thousands of LLM calls"), emit a +`warning` severity finding asking the human to confirm. + +The exact noun-to-resolver mapping lives in +`fact_check/numeric_refs.py` and is extensible per consumer. + +### Soft-fail output format + +When findings exist, append to the polished file before the closing +`` comment: + +```markdown +## Unresolved references + +> Auto-generated by attune-author fact-check. Review and either fix the +> source code, fix this doc, or add an override. + +| Location | Severity | Issue | +|---|---|---| +| Line 77 (code fence) | error | `from attune.ops._readers import …` — module not found | +| Line 17 (prose) | error | `attune ops --allow-run` — flag not in `attune ops --help` | +| Line 124 (See also) | warning | `[Concept: Template design patterns](concepts/template-patterns.md)` — target file does not exist | +``` + +### Configuration + +```toml +[tool.attune-author.fact-check] +enabled = true +soft_fail = true # false = raise on findings + +check_python_refs = true +check_cli_refs = true +check_md_links = true +check_numeric_claims = false # stretch; default off + +# Per-file or per-feature opt-out +[tool.attune-author.fact-check.skip] +"docs/architecture/some-feature.md" = ["check_md_links"] +``` + +CLI overrides for one-off runs: + +```bash +attune-author generate FEATURE --fact-check=strict +attune-author generate FEATURE --no-fact-check +``` + +### Phase 1 acceptance + +1. `check_polished_file` exists, callable from a test. +2. Running it on the four ops-dashboard docs from attune-ai PR #351 + produces findings that match the editorial pass diff: **5 of the 6** + errors fixed in `20438e8d` are flagged (Python refs ×2, CLI refs ×1, + Markdown link ×1, numeric claim ×1). The 6th error + (missing-security-callout for `0.0.0.0`) is explicitly Phase 3 scope. +3. Running it on the post-fix versions (current main of + `feat/ops-dashboard-help-templates`) produces zero error-severity + findings. +4. Soft-fail mode writes the unresolved-references block; strict mode + raises `FactCheckError`. +5. Every CLI-ref finding includes the version-coupling messaging block + (installed version + override snippet). + +--- + +## Phase 2 — Ground-truth context injection + +### Hook point + +`attune_author/polish.py` builds prompts in `_build_polish_prompt` (or +equivalent — verified during implementation). Inject ground-truth +context as additional `` blocks in the prompt before the +existing source-file block. + +### Context sources + +For a feature being polished: + +1. **CLI help**: if `attune ` is the consumer's CLI entry + for this feature (configured per-feature in `features.yaml` as + `cli_command: ops`), run ` --help` once + and inject the full output under a `` sentinel tag. +2. **Public API**: for each source `.py` file, extract `__all__` (if + defined) and the signatures of public functions/classes (no + underscore prefix). Inject under a `` sentinel. +3. **Dataclass fields**: for any `@dataclass` in the source files, + extract field names and types. Inject under a `` + sentinel. + +### Prompt anchoring + +Add a system-prompt clause (per the existing attune-rag +citation-forced-prompting pattern): + +> The following context blocks contain **ground-truth surface details** +> for this feature. When you reference a CLI flag, public function, +> import path, or dataclass field, it MUST appear verbatim in the +> matching context block. If you need to describe something that isn't +> in the ground truth, describe the behavior without inventing a +> specific name. + +### Context budget + +Cap injected context at 5KB total (configurable). Measured against the +ops-dashboard fixture: rendered `--help` is ~1.5KB, `__all__` + signatures +is ~2-3KB, dataclasses ~1KB. Fits comfortably. + +If budget exceeded, drop in this order: dataclasses → public API +signatures → `--help`. Drop with a log warning so the operator sees it. + +### Phase 2 acceptance + +1. Polishing `ops-dashboard` with Phase 2 enabled and Phase 1 disabled + produces docs where 0 of the 3 high-severity errors from the fixture + recur (CLI flag, private imports, route paths). +2. Total polish cost increases by less than 10% on average across the + regression fixture (3 features). +3. Context-budget violations log a warning but don't fail the polish. + +--- + +## Phase 3 — Faithfulness judge + +### Integration point + +attune-author depends on attune-rag (already; see +`pyproject.toml`). Import `attune_rag.eval.faithfulness.FaithfulnessJudge` +and wrap as a polish-pipeline post-step. + +```python +from attune_rag.eval.faithfulness import FaithfulnessJudge + +judge = FaithfulnessJudge(model="claude-haiku-4-5-20251001") +result = judge.score( + answer=polished_text, + sources=[src.read_text() for src in source_paths], +) +if result.mean_score < config.faithfulness_threshold: + # Append review block to file; do not block commit + ... +``` + +### Threshold calibration + +Before defaulting, run the judge against: + +- The pre-Phase-1 ops-dashboard fixture (6 errors): mean score should + be < 0.9. +- The post-fix versions (after `20438e8d`): mean score should be ≥ 0.95. + +If those two don't bracket cleanly, raise the threshold or tune the +prompt before defaulting. + +### Budget cap + +Estimated cost per file: ~$0.01-0.05 on Haiku 4.5. Per-feature +(11 kinds): ~$0.11-0.55. Per full regen (9 features): ~$1-5. + +Hard cap: skip the judge call if estimated cost exceeds $0.10 for a +single feature. The cap is configurable. + +### Phase 3 acceptance + +1. Judge runs on every polished file and writes a `## Faithfulness + review` block when mean score is below threshold. +2. Cost telemetry is reported at the end of each `attune-author + regenerate` run. +3. Threshold is configurable in `pyproject.toml` and via CLI flag. + +--- + +## Phase 4 — Tutorial code-sample static check + +### Scope + +Only `docs/tutorials/.md` files generated by attune-author. Other +docs (how-to, reference, architecture) may have code samples but +tutorials are where reader follow-along expectations are highest. + +### Static check pipeline + +For each polished tutorial: + +1. Extract all ` ```python ` fences. +2. For each fence: + a. `ast.parse(code)` — must succeed (syntax check). + b. Write to a temp file; run `mypy --strict --no-error-summary` with + attune installed in the active venv. + c. Collect failures into `FactCheckReport` findings with severity + `error` (mypy errors) or `warning` (mypy notes). + +### Sample opt-out + +For samples that intentionally use unresolved types (e.g., illustrative +pseudocode), add frontmatter inside the fence: + +```python +# attune-author: skip-mypy +some_function_we_havent_built_yet() +``` + +The line is stripped before publication. + +### No execution in Phase 4.1 + +Explicit non-goal: Phase 4.1 does **not** execute any sample. Reasoning +documented in the requirements doc (security + performance). + +### Phase 4 acceptance + +1. Running on the pre-edit tutorial `docs/tutorials/ops-dashboard.md` + flags both `_readers` and `_models` imports (mypy will report them + as unresolved imports against the installed `attune` package). +2. Running on the post-edit version produces zero errors. +3. Static check time per tutorial < 10s. + +--- + +## Open design questions (resolve before Phase 1 implementation) + +1. **`pyproject.toml` vs `.attune-author.toml`**: the regen-pipeline spec + uses env vars; we should match its config-file convention if one + exists. Confirm during Phase 1 task #1. +2. **Import resolution in the venv**: `importlib.import_module` against + the active venv works but couples the check to whichever attune-ai + version is installed. If a consumer is regenerating against an older + attune-ai, false negatives are possible. Acceptable tradeoff; + document. +3. **Phase 2 prompt-budget measurement**: do we measure context size in + characters, tokens, or both? Tokens are more accurate but require + running tokenizer. Start with characters; add tokenizer if drift + suggests it. diff --git a/docs/specs/polish-fact-check/requirements.md b/docs/specs/polish-fact-check/requirements.md new file mode 100644 index 0000000..7fd1f5b --- /dev/null +++ b/docs/specs/polish-fact-check/requirements.md @@ -0,0 +1,129 @@ +# Spec: Polish Fact-Check + +> Reduce attune-author polish-pass hallucinations through automated +> verification. Umbrella spec; ships as four sequential PRs (Phases 1–4). + +--- + +## Phase 1: Requirements + +**Status**: draft + +### Problem statement + +The attune-author polish pass routinely invents plausible-sounding surface +details that don't exist in the source files it was given. Concrete evidence +from a single feature regen (attune-ai's `ops-dashboard`, 2026-05-14, see +[attune-ai PR #351](https://github.com/Smart-AI-Memory/attune-ai/pull/351)): + +| Failure class | Count | Example | +|---|---|---| +| Hallucinated CLI flag | 1 | `--allow-run` (real: `--read-only`, inverted semantics) | +| Hallucinated private module path | 2 | `from attune.ops._readers import …` (`ModuleNotFoundError`) | +| Hallucinated cross-references | 4 | `Concept: Template design patterns` (no such doc) | +| Hallucinated count | 1 | `498 templates` (real: 259) | +| Wrong route path | 2 | `POST /run` (real: `POST /workflows/{name}/run`) | +| Insecure example | 1 | `host="0.0.0.0"` with no auth callout | + +Six distinct factual errors in a single feature's docs. Of these, three +(the CLI flag, the private import, and the wrong route) actively break +readers who follow the documentation literally. The current mitigation is +a manual editorial pass per feature — expensive, doesn't scale to the +remaining 9 stale features, and worse: it doesn't scale to a +weekly-or-faster regen cadence which is the whole premise of the living +help system. + +All six failure modes share a pattern: the LLM is filling in surrounding +scaffolding from priors rather than being constrained to the source +files it was given. The fix is to **shift verification work from human +review to automated checks**, while keeping the polish pass's freedom to +phrase, organize, and elaborate. + +### Scope + +**In scope:** + +- A four-phase intervention ladder, each phase shipping as its own PR: + 1. **AST-based post-generation fact-check** — Python-AST + CLI-help + + Markdown-link verification of polished output. Soft-fail (emit an + `## Unresolved references` block) initially; configurable to + strict-fail later. + 2. **Inject ground-truth context into the polish prompt** — for any + feature with a CLI surface, render `--help` output and inject it + under a `` sentinel tag in the prompt. Same for module + `__all__` and dataclass field lists. + 3. **Adapt the attune-rag faithfulness judge to polish output** — run + each polished file through `attune_rag.eval.faithfulness.FaithfulnessJudge` + against the source files; flag for review when score is below a + configurable threshold (default `0.95`). + 4. **Static analysis of tutorial code samples** — for `docs/tutorials/*.md` + specifically, extract Python code fences and run `mypy --strict` + + `ast.parse` against each. Catches the entire `_readers`/`_models` + hallucination class without executing untrusted code. +- A regression fixture: the ops-dashboard editorial pass diffs from + attune-ai PR #351 form a ground-truth set. Every check must catch the + errors that pass actually fixed. + +**Out of scope:** + +- Phase 4.2: actual execution of tutorial samples (Tier 1–3 sandboxing). + Discussed in the design doc as a future follow-up; gated on Phase 4.1 + data showing static analysis isn't sufficient. +- Polish prompt-engineering changes unrelated to ground-truth injection + (Phase 2 is narrowly scoped to context-injection, not prompt rewriting). +- Changes to the attune-rag faithfulness judge itself; Phase 3 only + *uses* the existing judge. +- Cost-side changes to the polish pass — none of the four phases changes + per-feature LLM cost beyond Phase 3's additional Haiku call (~$0.01-0.05 + per file). + +### Acceptance criteria + +**Per-phase exit criteria** are documented in `design.md` and `tasks.md`. +For the umbrella spec to be considered complete: + +1. **Phase 1 ships and the ops-dashboard regression fixture is reduced + from 6 errors → ≤1 error**. The remaining error is the + missing-security-callout for the `0.0.0.0` example — a genuinely + different failure shape (missing content, not wrong content) that + Phase 3 (faithfulness judge) is better suited to catch. Phase 1 + covers 5 of 6: Python refs, CLI refs, Markdown links, and **numeric + claims** (promoted from stretch to required per the decision matrix). +2. **Phases 2–4 ship in order**, each with its own PR and its own + regression delta against the same fixture. +3. **No regression in polish output quality** — measured by spot-checking + 3 features post-Phase-4 against pre-Phase-1 versions. The polish pass + should write *less* invented scaffolding, not less useful content. +4. **All four checks are configurable** — thresholds, severities, and + opt-out per-feature via `pyproject.toml` `[tool.attune-author.fact-check]`. + +### Non-goals / explicitly deferred + +- **Strict-fail by default in Phase 1**. The soft-fail default lets us + measure noise vs signal before tightening the gate. Pattern matches + the test-quality-program rubric's "measure first, gate later" + approach. +- **CI integration**. All four checks run during `attune-author generate` + / `attune-author regenerate`. CI integration (failing builds when + docs/ has unresolved references) is a follow-up after Phase 4 lands. +- **Generalizing beyond attune-ai's docs**. The fact-check operates on + any feature's generated templates; it doesn't assume the consumer is + attune-ai. But the regression fixture comes from attune-ai's + ops-dashboard, and we won't try to validate against arbitrary + third-party usage in this spec. + +### Decisions + +Pre-committed decisions live in [`decisions.md`](./decisions.md) and are +the source of truth. Calibration records and decision-change history +also live there. + +### Risks + +| Risk | Severity | Mitigation | +|---|---|---| +| AST checks produce too many false positives (soft-fail noise drowns signal) | Med | Soft-fail blocks at file bottom are scannable; track soft-fail rate per regen and tune resolvers | +| Phase 2 context injection blows polish prompt budget | Low | Measure context size before/after on the ops-dashboard fixture; cap injection at 5KB | +| Phase 3 faithfulness judge disagrees with our regression fixture | Med | Calibrate threshold against the ops-dashboard fixture before defaulting; document calibration | +| Phase 4 mypy false positives on legitimate `# type: ignore` patterns | Low | Allow `# attune-author: skip-mypy` frontmatter on individual samples | +| Bundled umbrella spec gates Phase 1 on broader approval than it needs | Low | Phase 1 framed as the "buy your way to value first" entry; explicitly approvable without committing to Phases 2–4 | diff --git a/docs/specs/polish-fact-check/tasks.md b/docs/specs/polish-fact-check/tasks.md new file mode 100644 index 0000000..d957c87 --- /dev/null +++ b/docs/specs/polish-fact-check/tasks.md @@ -0,0 +1,183 @@ +# Spec: Polish Fact-Check — Tasks + +## Phase 3: Tasks + +**Status**: draft + +> Four phases, each shipping as its own PR. Phase 1 is the recommended +> first commitment; Phases 2–4 build on it. Phase 1 can be approved and +> shipped independently of Phases 2–4. + +--- + +## Phase 1: AST-based post-generation fact-check + +**Target PR scope:** ~600 LOC including tests. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 1.1 | Decide config-file location (`pyproject.toml` vs new `.attune-author.toml`) | attune-author | todo | Match regen-pipeline convention; document decision in PR | +| 1.2 | Create `src/attune_author/fact_check/` package skeleton with `__init__.py`, `python_refs.py`, `cli_refs.py`, `md_links.py`, `numeric_refs.py`, `report.py` | attune-author | todo | One module per check + shared `FactCheckReport` dataclass | +| 1.3 | Implement `python_refs.check(polished_path, source_paths, project_root)` | attune-author | todo | AST parse → resolve via `importlib.import_module` in active venv | +| 1.4 | Implement `cli_refs.check(polished_path, project_root)` | attune-author | todo | Per-file cache of `attune --help` output; regex extract flag names. **Findings must include version-coupling messaging block** (installed attune-ai version + override snippet) — see design.md | +| 1.5 | Implement `md_links.check(polished_path, project_root)` | attune-author | todo | Resolve relative links; confirm target file exists | +| 1.5.1 | Implement `numeric_refs.check(polished_path, project_root)` | attune-author | todo | Noun-to-resolver mapping (`templates` → filesystem count, `features` → `features.yaml` key count, etc.). Severity: `error` on mismatch, `warning` on unverifiable nouns | +| 1.6 | Implement `report.format_unresolved_block(findings)` | attune-author | todo | Markdown table; severity column; appended above `` | +| 1.7 | Wire into `attune_author/polish.py` after the polish write | attune-author | todo | Soft-fail: append to file. Strict mode: raise `FactCheckError` | +| 1.8 | Add `[tool.attune-author.fact-check]` config schema + parser | attune-author | todo | `enabled`, `soft_fail`, per-check toggles, skip-list | +| 1.9 | Add `--fact-check=strict` / `--no-fact-check` CLI flags to `generate` and `regenerate` | attune-author | todo | Match existing CLI style | +| 1.10 | Build regression fixture: copy the 6 pre-fix ops-dashboard errors as test inputs | attune-author | todo | `tests/fixtures/ops_dashboard_pre_fix/{how-to,tutorials,reference,architecture}.md` | +| 1.11 | Test: each check fires on the matching fixture error | attune-author | todo | `test_python_refs_catches_underscore_module`, `test_cli_refs_catches_invented_flag`, `test_md_links_catches_missing_target`, `test_numeric_refs_catches_invented_count` | +| 1.11.1 | Test: CLI-ref finding contains version-coupling messaging | attune-author | todo | Assert installed version + override snippet appear in finding text | +| 1.12 | Test: zero findings on post-fix ops-dashboard versions | attune-author | todo | Pull from attune-ai PR #351 head | +| 1.13 | Test: soft-fail writes the block; strict mode raises | attune-author | todo | Two test cases | +| 1.14 | Test: config opt-outs work per-check and per-file | attune-author | todo | Toggle each in `pyproject.toml` test fixture | +| 1.15 | Update CHANGELOG with the four checks and the soft-fail default | attune-author | todo | Reference attune-ai PR #351 as motivation | +| 1.16 | Update README with a short "Fact-check" section + one example output | attune-author | todo | Keep it scannable; full docs go in attune-author's own help corpus later | + +### Phase 1 testing strategy + +- Pytest unit tests per check. Mock `importlib.import_module` only for + edge cases (e.g., a module that imports successfully but then raises); + prefer real imports against the actual attune package installed in + the test venv. +- Regression fixture frozen in-repo: the 4 pre-fix ops-dashboard docs + serve as ground truth. Test asserts that running fact-check on those + files produces ≥ the specific findings list in + `tests/fixtures/ops_dashboard_findings.yaml`. +- No external network in tests. `cli_refs` runs `attune --help` + against the locally-installed attune; this is acceptable because + attune-author already declares attune-ai as a dev dep. + +### Phase 1 exit checklist + +- [ ] All tasks 1.1–1.16 done +- [ ] CI green +- [ ] Regression fixture: **5/6 ops-dashboard errors caught** (Python + refs ×2 + CLI refs ×1 + Markdown links ×1 + numeric claims ×1). + The 6th error (missing-security-callout for `0.0.0.0`) is + explicitly Phase 3 scope. +- [ ] Zero findings on post-fix ops-dashboard versions +- [ ] CLI-ref findings include version-coupling messaging (verified by + test 1.11.1) +- [ ] CHANGELOG + README updated +- [ ] Spec status updated to `complete (Phase 1)` here + +--- + +## Phase 2: Ground-truth context injection + +**Target PR scope:** ~400 LOC including tests. Depends on Phase 1 (uses +the same regression fixture but is otherwise independent of fact-check +code). + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 2.1 | Add `cli_command` field to `Feature` (the manifest model) | attune-author | todo | Optional; absence skips CLI-help injection | +| 2.2 | Implement `ground_truth.extract_cli_help(cli_cmd, subcommand, project_root)` | attune-author | todo | `subprocess.run(...)` with timeout; cache per (cmd, subcommand) pair | +| 2.3 | Implement `ground_truth.extract_public_api(source_paths)` | attune-author | todo | AST-walk for `__all__` + non-underscore-prefixed defs | +| 2.4 | Implement `ground_truth.extract_dataclasses(source_paths)` | attune-author | todo | AST-walk for `@dataclass`; collect field names + type strings | +| 2.5 | Add ``, ``, `` sentinel blocks to polish prompt builder | attune-author | todo | Match existing context-block format | +| 2.6 | Add system-prompt anchoring clause | attune-author | todo | "Ground-truth context blocks contain surface details — names you use must appear verbatim" | +| 2.7 | Implement 5KB context budget enforcement with drop order | attune-author | todo | Log warning on drop; never fail | +| 2.8 | Add `[tool.attune-author.context-injection]` config + CLI flags | attune-author | todo | Defaults: all three sources on, 5KB budget | +| 2.9 | Test: ground-truth extractors produce expected output on ops-dashboard source | attune-author | todo | Snapshot tests | +| 2.10 | Test: polishing ops-dashboard with Phase 2 on, Phase 1 off recurs 0/3 high-severity errors | attune-author | todo | The acceptance gate from `design.md` | +| 2.11 | Test: budget enforcement drops sources in documented order | attune-author | todo | Artificial 1KB cap forces drops | +| 2.12 | Cost-delta measurement: 3-feature regression set with vs without Phase 2 | attune-author | todo | Record in CHANGELOG; should be < 10% | +| 2.13 | Update CHANGELOG + README | attune-author | todo | | + +### Phase 2 exit checklist + +- [ ] Tasks 2.1–2.13 done +- [ ] 0/3 high-severity ops-dashboard errors recur in Phase-2-only polish +- [ ] Cost delta < 10% +- [ ] Spec status updated + +--- + +## Phase 3: Faithfulness judge integration + +**Target PR scope:** ~300 LOC including tests. Depends on Phase 1 for +the `FactCheckReport` plumbing. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` | +| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` | +| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 | +| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers | +| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold | +| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log | +| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source | +| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | | +| 3.9 | Update CHANGELOG + README | attune-author | todo | | + +### Phase 3 exit checklist + +- [ ] Tasks 3.1–3.9 done +- [ ] Calibration shows clean separation between pre-fix and post-fix + fixture scores +- [ ] Threshold + cap configurable +- [ ] Spec status updated + +--- + +## Phase 4: Tutorial code-sample static check + +**Target PR scope:** ~250 LOC including tests. Depends on Phase 1 for +plumbing. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 4.1 | Add `tutorial_static_check.check(polished_path, project_root)` to `fact_check/` package | attune-author | todo | Operates only on `docs/tutorials/*.md` | +| 4.2 | Code-fence extractor: pull all ```python fences with line numbers | attune-author | todo | Skip fences with `# attune-author: skip-mypy` first line | +| 4.3 | `ast.parse` each fence; collect syntax errors as findings | attune-author | todo | Cheap pre-check before invoking mypy | +| 4.4 | Run `mypy --strict --no-error-summary` per fence | attune-author | todo | Subprocess; timeout 10s; capture stderr | +| 4.5 | Parse mypy output into findings | attune-author | todo | Map line numbers back to original fence position | +| 4.6 | Strip `# attune-author: skip-mypy` directives before publication | attune-author | todo | Apply only to the file written; preserve in source if any | +| 4.7 | Add `[tool.attune-author.fact-check.tutorial_static]` config | attune-author | todo | `enabled`, `mypy_args`, `timeout_seconds` | +| 4.8 | Test: pre-fix `tutorials/ops-dashboard.md` flags `_readers` + `_models` imports | attune-author | todo | The headline acceptance gate | +| 4.9 | Test: post-fix version produces zero errors | attune-author | todo | | +| 4.10 | Test: `skip-mypy` directive is honored and stripped from output | attune-author | todo | | +| 4.11 | Test: total static-check time per tutorial < 10s | attune-author | todo | Bench against the ops-dashboard tutorial | +| 4.12 | Update CHANGELOG + README | attune-author | todo | Note Phase 4.2 (execution) explicitly deferred | +| 4.13 | Add design.md follow-up section on Phase 4.2 execution tiers | attune-author | todo | Reference the security + perf walkthrough from spec discussion | + +### Phase 4 exit checklist + +- [ ] Tasks 4.1–4.13 done +- [ ] Pre-fix fixture flagged correctly; post-fix clean +- [ ] Per-tutorial check time < 10s +- [ ] Spec status updated; full umbrella spec marked `complete` + +--- + +## Cross-phase notes + +### Testing strategy across all phases + +- One regression fixture (the ops-dashboard editorial pass diff) used + by all phases. Lives at `tests/fixtures/ops_dashboard_pre_fix/` and + `tests/fixtures/ops_dashboard_post_fix/`. +- No mocking of LLM calls in integration tests. Mock at the unit-test + level (`anthropic.Anthropic`) per the regen-pipeline pattern. +- CI runs all four checks on every PR after Phase 4 lands; before that, + each phase's CI lane is its own job. + +### Rollback strategy + +Each phase has its own `enabled = true|false` toggle. If a phase causes +unexpected breakage in production regens, the operator can disable it +in `pyproject.toml` without touching code. Phase 1 is the only phase +whose disablement loses notable safety; the others gracefully degrade +to "no extra check." + +### Sequencing rationale + +Why this order: Phase 1 is the cheapest (no LLM, deterministic) and +catches the most distinct error types. Phase 2 has higher impact per +LLM-token but only matters if Phase 1's findings show consistent +hallucination patterns. Phase 3 needs Phase 1's plumbing for the +report-block format. Phase 4 is tutorial-specific and benefits from +Phase 1's `FactCheckReport` already existing.