diff --git a/docs/specs/polish-fact-check/decisions.md b/docs/specs/polish-fact-check/decisions.md
new file mode 100644
index 0000000..1784f85
--- /dev/null
+++ b/docs/specs/polish-fact-check/decisions.md
@@ -0,0 +1,43 @@
+# Spec: Polish Fact-Check — Decisions
+
+> Pre-committed decisions per the existing lesson "Pre-committed
+> decision matrices survive contact with data." Edits to this file
+> after Phase 1 ships require a follow-up PR with rationale.
+
+---
+
+## Decision matrix
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Phase 1 default failure mode | Soft-fail (`## Unresolved references` block at bottom of file) | Lets us measure noise vs signal before tightening the gate. Mirrors test-quality-program's "measure first, gate later" rubric pattern. |
+| Phase 1 strict-fail escalation criterion | Move to strict-fail if soft-fail rate drops below 5% across two consecutive **weekly** regens | Weekly cadence matches the help-system's intended regen rhythm; monthly would delay the escalation decision unnecessarily. |
+| Phase 1 numeric-claim check | Required (not stretch) | Patrick tightened the acceptance gate from 4/6 to 5/6 errors caught at Phase 1 ship. Numeric claims are AST-pattern-detectable; only the "missing security callout" failure mode stays for Phase 3. |
+| Phase 1 CLI-ref version coupling | Acceptable, with proactive user messaging | When a flag isn't found in `attune <cmd> --help`, the finding message includes (a) the installed attune-ai version, (b) instructions to verify against the target version, (c) override snippet. See `design.md` § Check 2. |
+| Phase 3 default faithfulness threshold | `0.95` (mean across paragraphs in a single file) | Untested at spec-draft time. Will be **calibrated** against the ops-dashboard regression fixture in Phase 3 task #3.3 before defaulting. If calibration shows pre-fix mean ≥ 0.9 or post-fix mean < 0.95, the threshold gets re-decided. |
+| Phase 3 threshold override mechanism | `pyproject.toml` `[tool.attune-author.fact-check]` + per-invocation CLI flag | Two-level override: project-wide config for sustained policy, CLI flag for one-off runs. |
+| Phase 3 budget cap | Skip judge call if estimated cost > `$0.10` for a single feature regen | Hard cap protects against unexpected cost when regenerating a feature with many kinds. Configurable. |
+| Phase 4 default | Tier 0 (static analysis only); execution requires explicit opt-in | Static analysis catches the documented failure modes (e.g. `_readers` private-module hallucinations) without executing untrusted LLM-generated code. Execution tiers documented in design.md for Phase 4.2 follow-up. |
+| Phase 4 execution opt-in mechanism | `# attune-author: exec` frontmatter on individual code samples | Sample-level granularity, not file-level — keeps the human reviewer responsible for confirming each blessed sample has no side effects. |
+| Spec-file convention going forward | Include `decisions.md` alongside `requirements.md` / `design.md` / `tasks.md` | Patrick's call (2026-05-14). Extracts pre-committed decisions from the spec body so they're easy to audit and update independently. |
+
+---
+
+## Calibration record
+
+To be filled in during Phase 3 implementation:
+
+- [ ] **Phase 3 threshold calibration** — Phase 3 task #3.3 / #3.4
+  - Pre-fix ops-dashboard mean faithfulness score: _TBD_
+  - Post-fix ops-dashboard mean faithfulness score: _TBD_
+  - Default threshold after calibration: _TBD_
+
+---
+
+## Decision-change log
+
+> Append entries here when a decision above is revised. Reference the PR
+> that revised it.
+
+- 2026-05-14 — Initial decisions captured during spec draft. Patrick
+  approved.
diff --git a/docs/specs/polish-fact-check/design.md b/docs/specs/polish-fact-check/design.md
new file mode 100644
index 0000000..6c27a34
--- /dev/null
+++ b/docs/specs/polish-fact-check/design.md
@@ -0,0 +1,408 @@
+# Spec: Polish Fact-Check — Design
+
+## Phase 2: Design
+
+**Status**: draft
+
+---
+
+## Overall architecture
+
+The four phases are independent layers stacked on the existing polish
+pipeline. Each phase ships as its own PR and is opt-in via configuration
+until Phase 4 (at which point all are on by default).
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  Existing polish pipeline (attune_author/polish.py)              │
+│                                                                  │
+│  source files ──► prompt builder ──► LLM ──► polished output     │
+│                        ▲                          │              │
+│                        │                          │              │
+└──────────── Phase 2 ───┼──────────── Phase 1 ─────┼──────────────┘
+                         │                          │
+                         │                          ▼
+              ┌──────────┴──────────┐    ┌──────────────────────┐
+              │ Ground-truth        │    │ AST fact-check pass  │
+              │ context injection   │    │ (post-generation)    │
+              │   • CLI --help      │    │   • imports          │
+              │   • __all__ list    │    │   • CLI flags        │
+              │   • dataclass fields│    │   • md links         │
+              └─────────────────────┘    │   • numeric claims*  │
+                                          └──────────┬───────────┘
+                                                     │
+                                                     ▼
+                                          ┌──────────────────────┐
+                                          │ Phase 3: faithfulness│
+                                          │ judge (attune-rag)   │
+                                          └──────────┬───────────┘
+                                                     │
+                                                     ▼
+                                          ┌──────────────────────┐
+                                          │ Phase 4: static check│
+                                          │ of tutorial samples  │
+                                          │ (mypy + ast.parse)   │
+                                          └──────────────────────┘
+
+* Numeric-claim checking is Phase 1 stretch; documented but skippable.
+```
+
+---
+
+## Phase 1 — AST-based post-generation fact-check
+
+### Module layout
+
+New module `src/attune_author/fact_check/` containing:
+
+```
+fact_check/
+├── __init__.py       # public entry: check_polished_file(path, source_paths)
+├── python_refs.py    # import-statement and dotted-path verification
+├── cli_refs.py       # `attune <cmd> --flag` verification
+├── md_links.py       # [label](target.md) target existence
+└── report.py         # collect findings, format soft-fail block
+```
+
+### Public API
+
+```python
+def check_polished_file(
+    polished_path: Path,
+    source_paths: list[Path],
+    *,
+    project_root: Path,
+    config: FactCheckConfig | None = None,
+) -> FactCheckReport
+```
+
+- `polished_path` — the just-written polished markdown file
+- `source_paths` — the source `.py` files the polish pass had as context
+- `project_root` — used to resolve relative markdown links and run
+  `<cmd> --help` invocations
+- `config` — optional config; defaults come from `pyproject.toml`
+
+Returns a `FactCheckReport` with a list of `Finding(severity, location,
+message)` entries. The caller decides whether to soft-fail (append to
+file) or strict-fail (raise).
+
+### Check 1: Python imports and dotted-path references
+
+For each Python code fence and inline `attune.X.Y` reference in the
+polished file:
+
+1. Extract import statements with `ast.parse` and walk the tree.
+2. For each `from X import Y`, attempt `importlib.import_module(X)` in
+   the active venv, then `getattr(module, Y)`. Failure = unresolved.
+3. For each prose reference matching `attune\.[a-z_.]+\.[A-Za-z_]+`,
+   resolve the full dotted path the same way.
+
+Why import in the venv rather than parse-only: catches the
+`attune.ops._readers` class of bug where the path *parses* fine but
+doesn't actually exist. This was the most damaging failure mode in the
+ops-dashboard fixture and parse-only AST won't catch it.
+
+### Check 2: CLI flag references
+
+For each `attune <subcommand> <flag>` pattern in the polished file:
+
+1. Build a cache: for each subcommand referenced, run
+   `attune <subcommand> --help` once and parse the output for flag
+   names (regex: `--[a-z][a-z0-9-]*`).
+2. For each `--flag` in the polished doc, confirm it appears in the
+   cached help output for that subcommand.
+3. Unknown subcommands are themselves a finding.
+
+Cache scope: per-file, so a regen of one feature doesn't reinvoke
+`--help` many times. Cache invalidation isn't needed (cache lives only
+for the duration of the check call).
+
+#### Version coupling — user-facing messaging
+
+CLI-ref checks resolve against whatever `attune-ai` is installed in the
+active venv. If a consumer is regenerating docs against a different
+attune-ai version, false positives are possible. Every CLI-ref finding
+includes proactive context so the user can resolve the ambiguity
+without spelunking:
+
+```
+Line 17 (prose): `attune ops --read-only` — flag not found.
+
+Detected against attune-ai 6.8.0 (installed in active venv). If you
+are regenerating against a different attune-ai version, verify the
+flag exists in that version's `attune ops --help`.
+
+To override:
+  - One-off:  attune-author generate FEATURE --skip-check cli_refs
+  - Per file: [tool.attune-author.fact-check.skip]
+              "docs/how-to/ops-dashboard.md" = ["check_cli_refs"]
+```
+
+The installed version is read from `attune.__version__` (or the
+`importlib.metadata` fallback). If the consumer CLI isn't `attune` —
+e.g. a third-party project using attune-author against its own CLI —
+the same template renders with that project's CLI name and version.
+
+### Check 3: Markdown link targets
+
+For each `[label](path/to.md)` or `[label](path/to.md#anchor)`:
+
+1. Resolve relative to `polished_path`'s directory.
+2. Confirm the target file exists.
+3. (Stretch: anchor existence by parsing target headers.)
+
+External URLs (`http://...`, `https://...`) are skipped — not in scope.
+
+### Check 4: Numeric claims
+
+Promoted from stretch to required (per decisions.md). Catches the
+`498 templates`-class of hallucination where the LLM invents a count
+that has no source-of-truth.
+
+For each sentence matching patterns like `\d+\s+(templates|features|workflows|skills|agents|workflows|tools|kinds)`:
+
+1. Extract the noun and the count.
+2. For nouns we can verify deterministically, count against the project
+   filesystem:
+   - `templates` → `find .help/templates -name "*.md" | wc -l`
+   - `features` → number of top-level keys in `.help/features.yaml`
+   - `workflows` → `list_workflows()` from the consumer's registry (if
+     declared in `features.yaml`)
+   - `kinds` → constant from attune-author (currently 11)
+3. Severity: `error` if the doc's count doesn't match the verified count,
+   `warning` if the noun isn't in the verifiable set (still surfaced for
+   human review).
+
+For nouns we *can't* verify (e.g., "thousands of LLM calls"), emit a
+`warning` severity finding asking the human to confirm.
+
+The exact noun-to-resolver mapping lives in
+`fact_check/numeric_refs.py` and is extensible per consumer.
+
+### Soft-fail output format
+
+When findings exist, append to the polished file before the closing
+`<!-- attune-generated ... -->` comment:
+
+```markdown
+## Unresolved references
+
+> Auto-generated by attune-author fact-check. Review and either fix the
+> source code, fix this doc, or add an override.
+
+| Location | Severity | Issue |
+|---|---|---|
+| Line 77 (code fence) | error | `from attune.ops._readers import …` — module not found |
+| Line 17 (prose) | error | `attune ops --allow-run` — flag not in `attune ops --help` |
+| Line 124 (See also) | warning | `[Concept: Template design patterns](concepts/template-patterns.md)` — target file does not exist |
+```
+
+### Configuration
+
+```toml
+[tool.attune-author.fact-check]
+enabled = true
+soft_fail = true                    # false = raise on findings
+
+check_python_refs = true
+check_cli_refs = true
+check_md_links = true
+check_numeric_claims = false        # stretch; default off
+
+# Per-file or per-feature opt-out
+[tool.attune-author.fact-check.skip]
+"docs/architecture/some-feature.md" = ["check_md_links"]
+```
+
+CLI overrides for one-off runs:
+
+```bash
+attune-author generate FEATURE --fact-check=strict
+attune-author generate FEATURE --no-fact-check
+```
+
+### Phase 1 acceptance
+
+1. `check_polished_file` exists, callable from a test.
+2. Running it on the four ops-dashboard docs from attune-ai PR #351
+   produces findings that match the editorial pass diff: **5 of the 6**
+   errors fixed in `20438e8d` are flagged (Python refs ×2, CLI refs ×1,
+   Markdown link ×1, numeric claim ×1). The 6th error
+   (missing-security-callout for `0.0.0.0`) is explicitly Phase 3 scope.
+3. Running it on the post-fix versions (current main of
+   `feat/ops-dashboard-help-templates`) produces zero error-severity
+   findings.
+4. Soft-fail mode writes the unresolved-references block; strict mode
+   raises `FactCheckError`.
+5. Every CLI-ref finding includes the version-coupling messaging block
+   (installed version + override snippet).
+
+---
+
+## Phase 2 — Ground-truth context injection
+
+### Hook point
+
+`attune_author/polish.py` builds prompts in `_build_polish_prompt` (or
+equivalent — verified during implementation). Inject ground-truth
+context as additional `<context>` blocks in the prompt before the
+existing source-file block.
+
+### Context sources
+
+For a feature being polished:
+
+1. **CLI help**: if `attune <subcommand>` is the consumer's CLI entry
+   for this feature (configured per-feature in `features.yaml` as
+   `cli_command: ops`), run `<consumer-cli> <subcommand> --help` once
+   and inject the full output under a `<cli_help>` sentinel tag.
+2. **Public API**: for each source `.py` file, extract `__all__` (if
+   defined) and the signatures of public functions/classes (no
+   underscore prefix). Inject under a `<public_api>` sentinel.
+3. **Dataclass fields**: for any `@dataclass` in the source files,
+   extract field names and types. Inject under a `<dataclasses>`
+   sentinel.
+
+### Prompt anchoring
+
+Add a system-prompt clause (per the existing attune-rag
+citation-forced-prompting pattern):
+
+> The following context blocks contain **ground-truth surface details**
+> for this feature. When you reference a CLI flag, public function,
+> import path, or dataclass field, it MUST appear verbatim in the
+> matching context block. If you need to describe something that isn't
+> in the ground truth, describe the behavior without inventing a
+> specific name.
+
+### Context budget
+
+Cap injected context at 5KB total (configurable). Measured against the
+ops-dashboard fixture: rendered `--help` is ~1.5KB, `__all__` + signatures
+is ~2-3KB, dataclasses ~1KB. Fits comfortably.
+
+If budget exceeded, drop in this order: dataclasses → public API
+signatures → `--help`. Drop with a log warning so the operator sees it.
+
+### Phase 2 acceptance
+
+1. Polishing `ops-dashboard` with Phase 2 enabled and Phase 1 disabled
+   produces docs where 0 of the 3 high-severity errors from the fixture
+   recur (CLI flag, private imports, route paths).
+2. Total polish cost increases by less than 10% on average across the
+   regression fixture (3 features).
+3. Context-budget violations log a warning but don't fail the polish.
+
+---
+
+## Phase 3 — Faithfulness judge
+
+### Integration point
+
+attune-author depends on attune-rag (already; see
+`pyproject.toml`). Import `attune_rag.eval.faithfulness.FaithfulnessJudge`
+and wrap as a polish-pipeline post-step.
+
+```python
+from attune_rag.eval.faithfulness import FaithfulnessJudge
+
+judge = FaithfulnessJudge(model="claude-haiku-4-5-20251001")
+result = judge.score(
+    answer=polished_text,
+    sources=[src.read_text() for src in source_paths],
+)
+if result.mean_score < config.faithfulness_threshold:
+    # Append review block to file; do not block commit
+    ...
+```
+
+### Threshold calibration
+
+Before defaulting, run the judge against:
+
+- The pre-Phase-1 ops-dashboard fixture (6 errors): mean score should
+  be < 0.9.
+- The post-fix versions (after `20438e8d`): mean score should be ≥ 0.95.
+
+If those two don't bracket cleanly, raise the threshold or tune the
+prompt before defaulting.
+
+### Budget cap
+
+Estimated cost per file: ~$0.01-0.05 on Haiku 4.5. Per-feature
+(11 kinds): ~$0.11-0.55. Per full regen (9 features): ~$1-5.
+
+Hard cap: skip the judge call if estimated cost exceeds $0.10 for a
+single feature. The cap is configurable.
+
+### Phase 3 acceptance
+
+1. Judge runs on every polished file and writes a `## Faithfulness
+   review` block when mean score is below threshold.
+2. Cost telemetry is reported at the end of each `attune-author
+   regenerate` run.
+3. Threshold is configurable in `pyproject.toml` and via CLI flag.
+
+---
+
+## Phase 4 — Tutorial code-sample static check
+
+### Scope
+
+Only `docs/tutorials/<feature>.md` files generated by attune-author. Other
+docs (how-to, reference, architecture) may have code samples but
+tutorials are where reader follow-along expectations are highest.
+
+### Static check pipeline
+
+For each polished tutorial:
+
+1. Extract all ` ```python ` fences.
+2. For each fence:
+   a. `ast.parse(code)` — must succeed (syntax check).
+   b. Write to a temp file; run `mypy --strict --no-error-summary` with
+      attune installed in the active venv.
+   c. Collect failures into `FactCheckReport` findings with severity
+      `error` (mypy errors) or `warning` (mypy notes).
+
+### Sample opt-out
+
+For samples that intentionally use unresolved types (e.g., illustrative
+pseudocode), add frontmatter inside the fence:
+
+```python
+# attune-author: skip-mypy
+some_function_we_havent_built_yet()
+```
+
+The line is stripped before publication.
+
+### No execution in Phase 4.1
+
+Explicit non-goal: Phase 4.1 does **not** execute any sample. Reasoning
+documented in the requirements doc (security + performance).
+
+### Phase 4 acceptance
+
+1. Running on the pre-edit tutorial `docs/tutorials/ops-dashboard.md`
+   flags both `_readers` and `_models` imports (mypy will report them
+   as unresolved imports against the installed `attune` package).
+2. Running on the post-edit version produces zero errors.
+3. Static check time per tutorial < 10s.
+
+---
+
+## Open design questions (resolve before Phase 1 implementation)
+
+1. **`pyproject.toml` vs `.attune-author.toml`**: the regen-pipeline spec
+   uses env vars; we should match its config-file convention if one
+   exists. Confirm during Phase 1 task #1.
+2. **Import resolution in the venv**: `importlib.import_module` against
+   the active venv works but couples the check to whichever attune-ai
+   version is installed. If a consumer is regenerating against an older
+   attune-ai, false negatives are possible. Acceptable tradeoff;
+   document.
+3. **Phase 2 prompt-budget measurement**: do we measure context size in
+   characters, tokens, or both? Tokens are more accurate but require
+   running tokenizer. Start with characters; add tokenizer if drift
+   suggests it.
diff --git a/docs/specs/polish-fact-check/requirements.md b/docs/specs/polish-fact-check/requirements.md
new file mode 100644
index 0000000..7fd1f5b
--- /dev/null
+++ b/docs/specs/polish-fact-check/requirements.md
@@ -0,0 +1,129 @@
+# Spec: Polish Fact-Check
+
+> Reduce attune-author polish-pass hallucinations through automated
+> verification. Umbrella spec; ships as four sequential PRs (Phases 1–4).
+
+---
+
+## Phase 1: Requirements
+
+**Status**: draft
+
+### Problem statement
+
+The attune-author polish pass routinely invents plausible-sounding surface
+details that don't exist in the source files it was given. Concrete evidence
+from a single feature regen (attune-ai's `ops-dashboard`, 2026-05-14, see
+[attune-ai PR #351](https://github.com/Smart-AI-Memory/attune-ai/pull/351)):
+
+| Failure class | Count | Example |
+|---|---|---|
+| Hallucinated CLI flag | 1 | `--allow-run` (real: `--read-only`, inverted semantics) |
+| Hallucinated private module path | 2 | `from attune.ops._readers import …` (`ModuleNotFoundError`) |
+| Hallucinated cross-references | 4 | `Concept: Template design patterns` (no such doc) |
+| Hallucinated count | 1 | `498 templates` (real: 259) |
+| Wrong route path | 2 | `POST /run` (real: `POST /workflows/{name}/run`) |
+| Insecure example | 1 | `host="0.0.0.0"` with no auth callout |
+
+Six distinct factual errors in a single feature's docs. Of these, three
+(the CLI flag, the private import, and the wrong route) actively break
+readers who follow the documentation literally. The current mitigation is
+a manual editorial pass per feature — expensive, doesn't scale to the
+remaining 9 stale features, and worse: it doesn't scale to a
+weekly-or-faster regen cadence which is the whole premise of the living
+help system.
+
+All six failure modes share a pattern: the LLM is filling in surrounding
+scaffolding from priors rather than being constrained to the source
+files it was given. The fix is to **shift verification work from human
+review to automated checks**, while keeping the polish pass's freedom to
+phrase, organize, and elaborate.
+
+### Scope
+
+**In scope:**
+
+- A four-phase intervention ladder, each phase shipping as its own PR:
+  1. **AST-based post-generation fact-check** — Python-AST + CLI-help +
+     Markdown-link verification of polished output. Soft-fail (emit an
+     `## Unresolved references` block) initially; configurable to
+     strict-fail later.
+  2. **Inject ground-truth context into the polish prompt** — for any
+     feature with a CLI surface, render `--help` output and inject it
+     under a `<cli_help>` sentinel tag in the prompt. Same for module
+     `__all__` and dataclass field lists.
+  3. **Adapt the attune-rag faithfulness judge to polish output** — run
+     each polished file through `attune_rag.eval.faithfulness.FaithfulnessJudge`
+     against the source files; flag for review when score is below a
+     configurable threshold (default `0.95`).
+  4. **Static analysis of tutorial code samples** — for `docs/tutorials/*.md`
+     specifically, extract Python code fences and run `mypy --strict` +
+     `ast.parse` against each. Catches the entire `_readers`/`_models`
+     hallucination class without executing untrusted code.
+- A regression fixture: the ops-dashboard editorial pass diffs from
+  attune-ai PR #351 form a ground-truth set. Every check must catch the
+  errors that pass actually fixed.
+
+**Out of scope:**
+
+- Phase 4.2: actual execution of tutorial samples (Tier 1–3 sandboxing).
+  Discussed in the design doc as a future follow-up; gated on Phase 4.1
+  data showing static analysis isn't sufficient.
+- Polish prompt-engineering changes unrelated to ground-truth injection
+  (Phase 2 is narrowly scoped to context-injection, not prompt rewriting).
+- Changes to the attune-rag faithfulness judge itself; Phase 3 only
+  *uses* the existing judge.
+- Cost-side changes to the polish pass — none of the four phases changes
+  per-feature LLM cost beyond Phase 3's additional Haiku call (~$0.01-0.05
+  per file).
+
+### Acceptance criteria
+
+**Per-phase exit criteria** are documented in `design.md` and `tasks.md`.
+For the umbrella spec to be considered complete:
+
+1. **Phase 1 ships and the ops-dashboard regression fixture is reduced
+   from 6 errors → ≤1 error**. The remaining error is the
+   missing-security-callout for the `0.0.0.0` example — a genuinely
+   different failure shape (missing content, not wrong content) that
+   Phase 3 (faithfulness judge) is better suited to catch. Phase 1
+   covers 5 of 6: Python refs, CLI refs, Markdown links, and **numeric
+   claims** (promoted from stretch to required per the decision matrix).
+2. **Phases 2–4 ship in order**, each with its own PR and its own
+   regression delta against the same fixture.
+3. **No regression in polish output quality** — measured by spot-checking
+   3 features post-Phase-4 against pre-Phase-1 versions. The polish pass
+   should write *less* invented scaffolding, not less useful content.
+4. **All four checks are configurable** — thresholds, severities, and
+   opt-out per-feature via `pyproject.toml` `[tool.attune-author.fact-check]`.
+
+### Non-goals / explicitly deferred
+
+- **Strict-fail by default in Phase 1**. The soft-fail default lets us
+  measure noise vs signal before tightening the gate. Pattern matches
+  the test-quality-program rubric's "measure first, gate later"
+  approach.
+- **CI integration**. All four checks run during `attune-author generate`
+  / `attune-author regenerate`. CI integration (failing builds when
+  docs/ has unresolved references) is a follow-up after Phase 4 lands.
+- **Generalizing beyond attune-ai's docs**. The fact-check operates on
+  any feature's generated templates; it doesn't assume the consumer is
+  attune-ai. But the regression fixture comes from attune-ai's
+  ops-dashboard, and we won't try to validate against arbitrary
+  third-party usage in this spec.
+
+### Decisions
+
+Pre-committed decisions live in [`decisions.md`](./decisions.md) and are
+the source of truth. Calibration records and decision-change history
+also live there.
+
+### Risks
+
+| Risk | Severity | Mitigation |
+|---|---|---|
+| AST checks produce too many false positives (soft-fail noise drowns signal) | Med | Soft-fail blocks at file bottom are scannable; track soft-fail rate per regen and tune resolvers |
+| Phase 2 context injection blows polish prompt budget | Low | Measure context size before/after on the ops-dashboard fixture; cap injection at 5KB |
+| Phase 3 faithfulness judge disagrees with our regression fixture | Med | Calibrate threshold against the ops-dashboard fixture before defaulting; document calibration |
+| Phase 4 mypy false positives on legitimate `# type: ignore` patterns | Low | Allow `# attune-author: skip-mypy` frontmatter on individual samples |
+| Bundled umbrella spec gates Phase 1 on broader approval than it needs | Low | Phase 1 framed as the "buy your way to value first" entry; explicitly approvable without committing to Phases 2–4 |
diff --git a/docs/specs/polish-fact-check/tasks.md b/docs/specs/polish-fact-check/tasks.md
new file mode 100644
index 0000000..d957c87
--- /dev/null
+++ b/docs/specs/polish-fact-check/tasks.md
@@ -0,0 +1,183 @@
+# Spec: Polish Fact-Check — Tasks
+
+## Phase 3: Tasks
+
+**Status**: draft
+
+> Four phases, each shipping as its own PR. Phase 1 is the recommended
+> first commitment; Phases 2–4 build on it. Phase 1 can be approved and
+> shipped independently of Phases 2–4.
+
+---
+
+## Phase 1: AST-based post-generation fact-check
+
+**Target PR scope:** ~600 LOC including tests.
+
+| # | Task | Layer | Status | Notes |
+|---|------|-------|--------|-------|
+| 1.1 | Decide config-file location (`pyproject.toml` vs new `.attune-author.toml`) | attune-author | todo | Match regen-pipeline convention; document decision in PR |
+| 1.2 | Create `src/attune_author/fact_check/` package skeleton with `__init__.py`, `python_refs.py`, `cli_refs.py`, `md_links.py`, `numeric_refs.py`, `report.py` | attune-author | todo | One module per check + shared `FactCheckReport` dataclass |
+| 1.3 | Implement `python_refs.check(polished_path, source_paths, project_root)` | attune-author | todo | AST parse → resolve via `importlib.import_module` in active venv |
+| 1.4 | Implement `cli_refs.check(polished_path, project_root)` | attune-author | todo | Per-file cache of `attune <cmd> --help` output; regex extract flag names. **Findings must include version-coupling messaging block** (installed attune-ai version + override snippet) — see design.md |
+| 1.5 | Implement `md_links.check(polished_path, project_root)` | attune-author | todo | Resolve relative links; confirm target file exists |
+| 1.5.1 | Implement `numeric_refs.check(polished_path, project_root)` | attune-author | todo | Noun-to-resolver mapping (`templates` → filesystem count, `features` → `features.yaml` key count, etc.). Severity: `error` on mismatch, `warning` on unverifiable nouns |
+| 1.6 | Implement `report.format_unresolved_block(findings)` | attune-author | todo | Markdown table; severity column; appended above `<!-- attune-generated ... -->` |
+| 1.7 | Wire into `attune_author/polish.py` after the polish write | attune-author | todo | Soft-fail: append to file. Strict mode: raise `FactCheckError` |
+| 1.8 | Add `[tool.attune-author.fact-check]` config schema + parser | attune-author | todo | `enabled`, `soft_fail`, per-check toggles, skip-list |
+| 1.9 | Add `--fact-check=strict` / `--no-fact-check` CLI flags to `generate` and `regenerate` | attune-author | todo | Match existing CLI style |
+| 1.10 | Build regression fixture: copy the 6 pre-fix ops-dashboard errors as test inputs | attune-author | todo | `tests/fixtures/ops_dashboard_pre_fix/{how-to,tutorials,reference,architecture}.md` |
+| 1.11 | Test: each check fires on the matching fixture error | attune-author | todo | `test_python_refs_catches_underscore_module`, `test_cli_refs_catches_invented_flag`, `test_md_links_catches_missing_target`, `test_numeric_refs_catches_invented_count` |
+| 1.11.1 | Test: CLI-ref finding contains version-coupling messaging | attune-author | todo | Assert installed version + override snippet appear in finding text |
+| 1.12 | Test: zero findings on post-fix ops-dashboard versions | attune-author | todo | Pull from attune-ai PR #351 head |
+| 1.13 | Test: soft-fail writes the block; strict mode raises | attune-author | todo | Two test cases |
+| 1.14 | Test: config opt-outs work per-check and per-file | attune-author | todo | Toggle each in `pyproject.toml` test fixture |
+| 1.15 | Update CHANGELOG with the four checks and the soft-fail default | attune-author | todo | Reference attune-ai PR #351 as motivation |
+| 1.16 | Update README with a short "Fact-check" section + one example output | attune-author | todo | Keep it scannable; full docs go in attune-author's own help corpus later |
+
+### Phase 1 testing strategy
+
+- Pytest unit tests per check. Mock `importlib.import_module` only for
+  edge cases (e.g., a module that imports successfully but then raises);
+  prefer real imports against the actual attune package installed in
+  the test venv.
+- Regression fixture frozen in-repo: the 4 pre-fix ops-dashboard docs
+  serve as ground truth. Test asserts that running fact-check on those
+  files produces ≥ the specific findings list in
+  `tests/fixtures/ops_dashboard_findings.yaml`.
+- No external network in tests. `cli_refs` runs `attune <cmd> --help`
+  against the locally-installed attune; this is acceptable because
+  attune-author already declares attune-ai as a dev dep.
+
+### Phase 1 exit checklist
+
+- [ ] All tasks 1.1–1.16 done
+- [ ] CI green
+- [ ] Regression fixture: **5/6 ops-dashboard errors caught** (Python
+      refs ×2 + CLI refs ×1 + Markdown links ×1 + numeric claims ×1).
+      The 6th error (missing-security-callout for `0.0.0.0`) is
+      explicitly Phase 3 scope.
+- [ ] Zero findings on post-fix ops-dashboard versions
+- [ ] CLI-ref findings include version-coupling messaging (verified by
+      test 1.11.1)
+- [ ] CHANGELOG + README updated
+- [ ] Spec status updated to `complete (Phase 1)` here
+
+---
+
+## Phase 2: Ground-truth context injection
+
+**Target PR scope:** ~400 LOC including tests. Depends on Phase 1 (uses
+the same regression fixture but is otherwise independent of fact-check
+code).
+
+| # | Task | Layer | Status | Notes |
+|---|------|-------|--------|-------|
+| 2.1 | Add `cli_command` field to `Feature` (the manifest model) | attune-author | todo | Optional; absence skips CLI-help injection |
+| 2.2 | Implement `ground_truth.extract_cli_help(cli_cmd, subcommand, project_root)` | attune-author | todo | `subprocess.run(...)` with timeout; cache per (cmd, subcommand) pair |
+| 2.3 | Implement `ground_truth.extract_public_api(source_paths)` | attune-author | todo | AST-walk for `__all__` + non-underscore-prefixed defs |
+| 2.4 | Implement `ground_truth.extract_dataclasses(source_paths)` | attune-author | todo | AST-walk for `@dataclass`; collect field names + type strings |
+| 2.5 | Add `<cli_help>`, `<public_api>`, `<dataclasses>` sentinel blocks to polish prompt builder | attune-author | todo | Match existing context-block format |
+| 2.6 | Add system-prompt anchoring clause | attune-author | todo | "Ground-truth context blocks contain surface details — names you use must appear verbatim" |
+| 2.7 | Implement 5KB context budget enforcement with drop order | attune-author | todo | Log warning on drop; never fail |
+| 2.8 | Add `[tool.attune-author.context-injection]` config + CLI flags | attune-author | todo | Defaults: all three sources on, 5KB budget |
+| 2.9 | Test: ground-truth extractors produce expected output on ops-dashboard source | attune-author | todo | Snapshot tests |
+| 2.10 | Test: polishing ops-dashboard with Phase 2 on, Phase 1 off recurs 0/3 high-severity errors | attune-author | todo | The acceptance gate from `design.md` |
+| 2.11 | Test: budget enforcement drops sources in documented order | attune-author | todo | Artificial 1KB cap forces drops |
+| 2.12 | Cost-delta measurement: 3-feature regression set with vs without Phase 2 | attune-author | todo | Record in CHANGELOG; should be < 10% |
+| 2.13 | Update CHANGELOG + README | attune-author | todo | |
+
+### Phase 2 exit checklist
+
+- [ ] Tasks 2.1–2.13 done
+- [ ] 0/3 high-severity ops-dashboard errors recur in Phase-2-only polish
+- [ ] Cost delta < 10%
+- [ ] Spec status updated
+
+---
+
+## Phase 3: Faithfulness judge integration
+
+**Target PR scope:** ~300 LOC including tests. Depends on Phase 1 for
+the `FactCheckReport` plumbing.
+
+| # | Task | Layer | Status | Notes |
+|---|------|-------|--------|-------|
+| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` |
+| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` |
+| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 |
+| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers |
+| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold |
+| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log |
+| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source |
+| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | |
+| 3.9 | Update CHANGELOG + README | attune-author | todo | |
+
+### Phase 3 exit checklist
+
+- [ ] Tasks 3.1–3.9 done
+- [ ] Calibration shows clean separation between pre-fix and post-fix
+      fixture scores
+- [ ] Threshold + cap configurable
+- [ ] Spec status updated
+
+---
+
+## Phase 4: Tutorial code-sample static check
+
+**Target PR scope:** ~250 LOC including tests. Depends on Phase 1 for
+plumbing.
+
+| # | Task | Layer | Status | Notes |
+|---|------|-------|--------|-------|
+| 4.1 | Add `tutorial_static_check.check(polished_path, project_root)` to `fact_check/` package | attune-author | todo | Operates only on `docs/tutorials/*.md` |
+| 4.2 | Code-fence extractor: pull all ```python fences with line numbers | attune-author | todo | Skip fences with `# attune-author: skip-mypy` first line |
+| 4.3 | `ast.parse` each fence; collect syntax errors as findings | attune-author | todo | Cheap pre-check before invoking mypy |
+| 4.4 | Run `mypy --strict --no-error-summary` per fence | attune-author | todo | Subprocess; timeout 10s; capture stderr |
+| 4.5 | Parse mypy output into findings | attune-author | todo | Map line numbers back to original fence position |
+| 4.6 | Strip `# attune-author: skip-mypy` directives before publication | attune-author | todo | Apply only to the file written; preserve in source if any |
+| 4.7 | Add `[tool.attune-author.fact-check.tutorial_static]` config | attune-author | todo | `enabled`, `mypy_args`, `timeout_seconds` |
+| 4.8 | Test: pre-fix `tutorials/ops-dashboard.md` flags `_readers` + `_models` imports | attune-author | todo | The headline acceptance gate |
+| 4.9 | Test: post-fix version produces zero errors | attune-author | todo | |
+| 4.10 | Test: `skip-mypy` directive is honored and stripped from output | attune-author | todo | |
+| 4.11 | Test: total static-check time per tutorial < 10s | attune-author | todo | Bench against the ops-dashboard tutorial |
+| 4.12 | Update CHANGELOG + README | attune-author | todo | Note Phase 4.2 (execution) explicitly deferred |
+| 4.13 | Add design.md follow-up section on Phase 4.2 execution tiers | attune-author | todo | Reference the security + perf walkthrough from spec discussion |
+
+### Phase 4 exit checklist
+
+- [ ] Tasks 4.1–4.13 done
+- [ ] Pre-fix fixture flagged correctly; post-fix clean
+- [ ] Per-tutorial check time < 10s
+- [ ] Spec status updated; full umbrella spec marked `complete`
+
+---
+
+## Cross-phase notes
+
+### Testing strategy across all phases
+
+- One regression fixture (the ops-dashboard editorial pass diff) used
+  by all phases. Lives at `tests/fixtures/ops_dashboard_pre_fix/` and
+  `tests/fixtures/ops_dashboard_post_fix/`.
+- No mocking of LLM calls in integration tests. Mock at the unit-test
+  level (`anthropic.Anthropic`) per the regen-pipeline pattern.
+- CI runs all four checks on every PR after Phase 4 lands; before that,
+  each phase's CI lane is its own job.
+
+### Rollback strategy
+
+Each phase has its own `enabled = true|false` toggle. If a phase causes
+unexpected breakage in production regens, the operator can disable it
+in `pyproject.toml` without touching code. Phase 1 is the only phase
+whose disablement loses notable safety; the others gracefully degrade
+to "no extra check."
+
+### Sequencing rationale
+
+Why this order: Phase 1 is the cheapest (no LLM, deterministic) and
+catches the most distinct error types. Phase 2 has higher impact per
+LLM-token but only matters if Phase 1's findings show consistent
+hallucination patterns. Phase 3 needs Phase 1's plumbing for the
+report-block format. Phase 4 is tutorial-specific and benefits from
+Phase 1's `FactCheckReport` already existing.