From 684d4789f534a234f94f560f06b23d61893d92bd Mon Sep 17 00:00:00 2001 From: GeneAI Date: Thu, 14 May 2026 13:55:09 -0400 Subject: [PATCH] =?UTF-8?q?spec(docs):=20polish-fact-check=20umbrella=20sp?= =?UTF-8?q?ec=20=E2=80=94=204=20phases=20to=20reduce=20hallucinations?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds docs/specs/polish-fact-check/ — an umbrella spec for a four-phase intervention ladder that shifts polish-pass verification work from human editorial review to automated checks. Motivated by a regression fixture from attune-ai PR #351 (Smart-AI-Memory/attune-ai#351), where the ops-dashboard regen produced six factual errors in a single feature's docs: - 1 hallucinated CLI flag (--allow-run, real flag is --read-only) - 2 hallucinated private module paths (attune.ops._readers) - 4 hallucinated cross-references - 1 hallucinated count (498 templates vs real 259) - 2 wrong route paths - 1 insecure example (0.0.0.0 without auth callout) Three of the six (CLI flag, private imports, wrong routes) would actively break readers who follow the docs literally. Four phases, each shipping as its own PR: Phase 1: AST-based post-generation fact-check (Python refs, CLI flags, Markdown links, numeric claims) — catches 5 of 6 fixture errors. Cheapest, no LLM cost. Phase 2: Ground-truth context injection into polish prompt (CLI --help output, __all__, dataclass fields). Phase 3: Adapt attune-rag faithfulness judge as a polish post-step. Catches the 6th fixture error (missing-security-callout). Phase 4: Static analysis of tutorial code samples (mypy + ast.parse). Execution tiers explicitly deferred to Phase 4.2 for security reasons. Files: - requirements.md — problem statement, scope, acceptance - design.md — architecture, per-phase API shapes, open design questions - tasks.md — numbered tasks per phase, exit checklists - decisions.md — pre-committed decision matrix (introduces a spec-file convention) Status: draft. Awaiting review/approval before Phase 1 implementation begins. Co-Authored-By: Claude Opus 4.7 --- docs/specs/polish-fact-check/decisions.md | 43 ++ docs/specs/polish-fact-check/design.md | 408 +++++++++++++++++++ docs/specs/polish-fact-check/requirements.md | 129 ++++++ docs/specs/polish-fact-check/tasks.md | 183 +++++++++ 4 files changed, 763 insertions(+) create mode 100644 docs/specs/polish-fact-check/decisions.md create mode 100644 docs/specs/polish-fact-check/design.md create mode 100644 docs/specs/polish-fact-check/requirements.md create mode 100644 docs/specs/polish-fact-check/tasks.md diff --git a/docs/specs/polish-fact-check/decisions.md b/docs/specs/polish-fact-check/decisions.md new file mode 100644 index 0000000..1784f85 --- /dev/null +++ b/docs/specs/polish-fact-check/decisions.md @@ -0,0 +1,43 @@ +# Spec: Polish Fact-Check — Decisions + +> Pre-committed decisions per the existing lesson "Pre-committed +> decision matrices survive contact with data." Edits to this file +> after Phase 1 ships require a follow-up PR with rationale. + +--- + +## Decision matrix + +| Decision | Choice | Rationale | +|---|---|---| +| Phase 1 default failure mode | Soft-fail (`## Unresolved references` block at bottom of file) | Lets us measure noise vs signal before tightening the gate. Mirrors test-quality-program's "measure first, gate later" rubric pattern. | +| Phase 1 strict-fail escalation criterion | Move to strict-fail if soft-fail rate drops below 5% across two consecutive **weekly** regens | Weekly cadence matches the help-system's intended regen rhythm; monthly would delay the escalation decision unnecessarily. | +| Phase 1 numeric-claim check | Required (not stretch) | Patrick tightened the acceptance gate from 4/6 to 5/6 errors caught at Phase 1 ship. Numeric claims are AST-pattern-detectable; only the "missing security callout" failure mode stays for Phase 3. | +| Phase 1 CLI-ref version coupling | Acceptable, with proactive user messaging | When a flag isn't found in `attune --help`, the finding message includes (a) the installed attune-ai version, (b) instructions to verify against the target version, (c) override snippet. See `design.md` § Check 2. | +| Phase 3 default faithfulness threshold | `0.95` (mean across paragraphs in a single file) | Untested at spec-draft time. Will be **calibrated** against the ops-dashboard regression fixture in Phase 3 task #3.3 before defaulting. If calibration shows pre-fix mean ≥ 0.9 or post-fix mean < 0.95, the threshold gets re-decided. | +| Phase 3 threshold override mechanism | `pyproject.toml` `[tool.attune-author.fact-check]` + per-invocation CLI flag | Two-level override: project-wide config for sustained policy, CLI flag for one-off runs. | +| Phase 3 budget cap | Skip judge call if estimated cost > `$0.10` for a single feature regen | Hard cap protects against unexpected cost when regenerating a feature with many kinds. Configurable. | +| Phase 4 default | Tier 0 (static analysis only); execution requires explicit opt-in | Static analysis catches the documented failure modes (e.g. `_readers` private-module hallucinations) without executing untrusted LLM-generated code. Execution tiers documented in design.md for Phase 4.2 follow-up. | +| Phase 4 execution opt-in mechanism | `# attune-author: exec` frontmatter on individual code samples | Sample-level granularity, not file-level — keeps the human reviewer responsible for confirming each blessed sample has no side effects. | +| Spec-file convention going forward | Include `decisions.md` alongside `requirements.md` / `design.md` / `tasks.md` | Patrick's call (2026-05-14). Extracts pre-committed decisions from the spec body so they're easy to audit and update independently. | + +--- + +## Calibration record + +To be filled in during Phase 3 implementation: + +- [ ] **Phase 3 threshold calibration** — Phase 3 task #3.3 / #3.4 + - Pre-fix ops-dashboard mean faithfulness score: _TBD_ + - Post-fix ops-dashboard mean faithfulness score: _TBD_ + - Default threshold after calibration: _TBD_ + +--- + +## Decision-change log + +> Append entries here when a decision above is revised. Reference the PR +> that revised it. + +- 2026-05-14 — Initial decisions captured during spec draft. Patrick + approved. diff --git a/docs/specs/polish-fact-check/design.md b/docs/specs/polish-fact-check/design.md new file mode 100644 index 0000000..6c27a34 --- /dev/null +++ b/docs/specs/polish-fact-check/design.md @@ -0,0 +1,408 @@ +# Spec: Polish Fact-Check — Design + +## Phase 2: Design + +**Status**: draft + +--- + +## Overall architecture + +The four phases are independent layers stacked on the existing polish +pipeline. Each phase ships as its own PR and is opt-in via configuration +until Phase 4 (at which point all are on by default). + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Existing polish pipeline (attune_author/polish.py) │ +│ │ +│ source files ──► prompt builder ──► LLM ──► polished output │ +│ ▲ │ │ +│ │ │ │ +└──────────── Phase 2 ───┼──────────── Phase 1 ─────┼──────────────┘ + │ │ + │ ▼ + ┌──────────┴──────────┐ ┌──────────────────────┐ + │ Ground-truth │ │ AST fact-check pass │ + │ context injection │ │ (post-generation) │ + │ • CLI --help │ │ • imports │ + │ • __all__ list │ │ • CLI flags │ + │ • dataclass fields│ │ • md links │ + └─────────────────────┘ │ • numeric claims* │ + └──────────┬───────────┘ + │ + ▼ + ┌──────────────────────┐ + │ Phase 3: faithfulness│ + │ judge (attune-rag) │ + └──────────┬───────────┘ + │ + ▼ + ┌──────────────────────┐ + │ Phase 4: static check│ + │ of tutorial samples │ + │ (mypy + ast.parse) │ + └──────────────────────┘ + +* Numeric-claim checking is Phase 1 stretch; documented but skippable. +``` + +--- + +## Phase 1 — AST-based post-generation fact-check + +### Module layout + +New module `src/attune_author/fact_check/` containing: + +``` +fact_check/ +├── __init__.py # public entry: check_polished_file(path, source_paths) +├── python_refs.py # import-statement and dotted-path verification +├── cli_refs.py # `attune --flag` verification +├── md_links.py # [label](target.md) target existence +└── report.py # collect findings, format soft-fail block +``` + +### Public API + +```python +def check_polished_file( + polished_path: Path, + source_paths: list[Path], + *, + project_root: Path, + config: FactCheckConfig | None = None, +) -> FactCheckReport +``` + +- `polished_path` — the just-written polished markdown file +- `source_paths` — the source `.py` files the polish pass had as context +- `project_root` — used to resolve relative markdown links and run + ` --help` invocations +- `config` — optional config; defaults come from `pyproject.toml` + +Returns a `FactCheckReport` with a list of `Finding(severity, location, +message)` entries. The caller decides whether to soft-fail (append to +file) or strict-fail (raise). + +### Check 1: Python imports and dotted-path references + +For each Python code fence and inline `attune.X.Y` reference in the +polished file: + +1. Extract import statements with `ast.parse` and walk the tree. +2. For each `from X import Y`, attempt `importlib.import_module(X)` in + the active venv, then `getattr(module, Y)`. Failure = unresolved. +3. For each prose reference matching `attune\.[a-z_.]+\.[A-Za-z_]+`, + resolve the full dotted path the same way. + +Why import in the venv rather than parse-only: catches the +`attune.ops._readers` class of bug where the path *parses* fine but +doesn't actually exist. This was the most damaging failure mode in the +ops-dashboard fixture and parse-only AST won't catch it. + +### Check 2: CLI flag references + +For each `attune ` pattern in the polished file: + +1. Build a cache: for each subcommand referenced, run + `attune --help` once and parse the output for flag + names (regex: `--[a-z][a-z0-9-]*`). +2. For each `--flag` in the polished doc, confirm it appears in the + cached help output for that subcommand. +3. Unknown subcommands are themselves a finding. + +Cache scope: per-file, so a regen of one feature doesn't reinvoke +`--help` many times. Cache invalidation isn't needed (cache lives only +for the duration of the check call). + +#### Version coupling — user-facing messaging + +CLI-ref checks resolve against whatever `attune-ai` is installed in the +active venv. If a consumer is regenerating docs against a different +attune-ai version, false positives are possible. Every CLI-ref finding +includes proactive context so the user can resolve the ambiguity +without spelunking: + +``` +Line 17 (prose): `attune ops --read-only` — flag not found. + +Detected against attune-ai 6.8.0 (installed in active venv). If you +are regenerating against a different attune-ai version, verify the +flag exists in that version's `attune ops --help`. + +To override: + - One-off: attune-author generate FEATURE --skip-check cli_refs + - Per file: [tool.attune-author.fact-check.skip] + "docs/how-to/ops-dashboard.md" = ["check_cli_refs"] +``` + +The installed version is read from `attune.__version__` (or the +`importlib.metadata` fallback). If the consumer CLI isn't `attune` — +e.g. a third-party project using attune-author against its own CLI — +the same template renders with that project's CLI name and version. + +### Check 3: Markdown link targets + +For each `[label](path/to.md)` or `[label](path/to.md#anchor)`: + +1. Resolve relative to `polished_path`'s directory. +2. Confirm the target file exists. +3. (Stretch: anchor existence by parsing target headers.) + +External URLs (`http://...`, `https://...`) are skipped — not in scope. + +### Check 4: Numeric claims + +Promoted from stretch to required (per decisions.md). Catches the +`498 templates`-class of hallucination where the LLM invents a count +that has no source-of-truth. + +For each sentence matching patterns like `\d+\s+(templates|features|workflows|skills|agents|workflows|tools|kinds)`: + +1. Extract the noun and the count. +2. For nouns we can verify deterministically, count against the project + filesystem: + - `templates` → `find .help/templates -name "*.md" | wc -l` + - `features` → number of top-level keys in `.help/features.yaml` + - `workflows` → `list_workflows()` from the consumer's registry (if + declared in `features.yaml`) + - `kinds` → constant from attune-author (currently 11) +3. Severity: `error` if the doc's count doesn't match the verified count, + `warning` if the noun isn't in the verifiable set (still surfaced for + human review). + +For nouns we *can't* verify (e.g., "thousands of LLM calls"), emit a +`warning` severity finding asking the human to confirm. + +The exact noun-to-resolver mapping lives in +`fact_check/numeric_refs.py` and is extensible per consumer. + +### Soft-fail output format + +When findings exist, append to the polished file before the closing +`` comment: + +```markdown +## Unresolved references + +> Auto-generated by attune-author fact-check. Review and either fix the +> source code, fix this doc, or add an override. + +| Location | Severity | Issue | +|---|---|---| +| Line 77 (code fence) | error | `from attune.ops._readers import …` — module not found | +| Line 17 (prose) | error | `attune ops --allow-run` — flag not in `attune ops --help` | +| Line 124 (See also) | warning | `[Concept: Template design patterns](concepts/template-patterns.md)` — target file does not exist | +``` + +### Configuration + +```toml +[tool.attune-author.fact-check] +enabled = true +soft_fail = true # false = raise on findings + +check_python_refs = true +check_cli_refs = true +check_md_links = true +check_numeric_claims = false # stretch; default off + +# Per-file or per-feature opt-out +[tool.attune-author.fact-check.skip] +"docs/architecture/some-feature.md" = ["check_md_links"] +``` + +CLI overrides for one-off runs: + +```bash +attune-author generate FEATURE --fact-check=strict +attune-author generate FEATURE --no-fact-check +``` + +### Phase 1 acceptance + +1. `check_polished_file` exists, callable from a test. +2. Running it on the four ops-dashboard docs from attune-ai PR #351 + produces findings that match the editorial pass diff: **5 of the 6** + errors fixed in `20438e8d` are flagged (Python refs ×2, CLI refs ×1, + Markdown link ×1, numeric claim ×1). The 6th error + (missing-security-callout for `0.0.0.0`) is explicitly Phase 3 scope. +3. Running it on the post-fix versions (current main of + `feat/ops-dashboard-help-templates`) produces zero error-severity + findings. +4. Soft-fail mode writes the unresolved-references block; strict mode + raises `FactCheckError`. +5. Every CLI-ref finding includes the version-coupling messaging block + (installed version + override snippet). + +--- + +## Phase 2 — Ground-truth context injection + +### Hook point + +`attune_author/polish.py` builds prompts in `_build_polish_prompt` (or +equivalent — verified during implementation). Inject ground-truth +context as additional `` blocks in the prompt before the +existing source-file block. + +### Context sources + +For a feature being polished: + +1. **CLI help**: if `attune ` is the consumer's CLI entry + for this feature (configured per-feature in `features.yaml` as + `cli_command: ops`), run ` --help` once + and inject the full output under a `` sentinel tag. +2. **Public API**: for each source `.py` file, extract `__all__` (if + defined) and the signatures of public functions/classes (no + underscore prefix). Inject under a `` sentinel. +3. **Dataclass fields**: for any `@dataclass` in the source files, + extract field names and types. Inject under a `` + sentinel. + +### Prompt anchoring + +Add a system-prompt clause (per the existing attune-rag +citation-forced-prompting pattern): + +> The following context blocks contain **ground-truth surface details** +> for this feature. When you reference a CLI flag, public function, +> import path, or dataclass field, it MUST appear verbatim in the +> matching context block. If you need to describe something that isn't +> in the ground truth, describe the behavior without inventing a +> specific name. + +### Context budget + +Cap injected context at 5KB total (configurable). Measured against the +ops-dashboard fixture: rendered `--help` is ~1.5KB, `__all__` + signatures +is ~2-3KB, dataclasses ~1KB. Fits comfortably. + +If budget exceeded, drop in this order: dataclasses → public API +signatures → `--help`. Drop with a log warning so the operator sees it. + +### Phase 2 acceptance + +1. Polishing `ops-dashboard` with Phase 2 enabled and Phase 1 disabled + produces docs where 0 of the 3 high-severity errors from the fixture + recur (CLI flag, private imports, route paths). +2. Total polish cost increases by less than 10% on average across the + regression fixture (3 features). +3. Context-budget violations log a warning but don't fail the polish. + +--- + +## Phase 3 — Faithfulness judge + +### Integration point + +attune-author depends on attune-rag (already; see +`pyproject.toml`). Import `attune_rag.eval.faithfulness.FaithfulnessJudge` +and wrap as a polish-pipeline post-step. + +```python +from attune_rag.eval.faithfulness import FaithfulnessJudge + +judge = FaithfulnessJudge(model="claude-haiku-4-5-20251001") +result = judge.score( + answer=polished_text, + sources=[src.read_text() for src in source_paths], +) +if result.mean_score < config.faithfulness_threshold: + # Append review block to file; do not block commit + ... +``` + +### Threshold calibration + +Before defaulting, run the judge against: + +- The pre-Phase-1 ops-dashboard fixture (6 errors): mean score should + be < 0.9. +- The post-fix versions (after `20438e8d`): mean score should be ≥ 0.95. + +If those two don't bracket cleanly, raise the threshold or tune the +prompt before defaulting. + +### Budget cap + +Estimated cost per file: ~$0.01-0.05 on Haiku 4.5. Per-feature +(11 kinds): ~$0.11-0.55. Per full regen (9 features): ~$1-5. + +Hard cap: skip the judge call if estimated cost exceeds $0.10 for a +single feature. The cap is configurable. + +### Phase 3 acceptance + +1. Judge runs on every polished file and writes a `## Faithfulness + review` block when mean score is below threshold. +2. Cost telemetry is reported at the end of each `attune-author + regenerate` run. +3. Threshold is configurable in `pyproject.toml` and via CLI flag. + +--- + +## Phase 4 — Tutorial code-sample static check + +### Scope + +Only `docs/tutorials/.md` files generated by attune-author. Other +docs (how-to, reference, architecture) may have code samples but +tutorials are where reader follow-along expectations are highest. + +### Static check pipeline + +For each polished tutorial: + +1. Extract all ` ```python ` fences. +2. For each fence: + a. `ast.parse(code)` — must succeed (syntax check). + b. Write to a temp file; run `mypy --strict --no-error-summary` with + attune installed in the active venv. + c. Collect failures into `FactCheckReport` findings with severity + `error` (mypy errors) or `warning` (mypy notes). + +### Sample opt-out + +For samples that intentionally use unresolved types (e.g., illustrative +pseudocode), add frontmatter inside the fence: + +```python +# attune-author: skip-mypy +some_function_we_havent_built_yet() +``` + +The line is stripped before publication. + +### No execution in Phase 4.1 + +Explicit non-goal: Phase 4.1 does **not** execute any sample. Reasoning +documented in the requirements doc (security + performance). + +### Phase 4 acceptance + +1. Running on the pre-edit tutorial `docs/tutorials/ops-dashboard.md` + flags both `_readers` and `_models` imports (mypy will report them + as unresolved imports against the installed `attune` package). +2. Running on the post-edit version produces zero errors. +3. Static check time per tutorial < 10s. + +--- + +## Open design questions (resolve before Phase 1 implementation) + +1. **`pyproject.toml` vs `.attune-author.toml`**: the regen-pipeline spec + uses env vars; we should match its config-file convention if one + exists. Confirm during Phase 1 task #1. +2. **Import resolution in the venv**: `importlib.import_module` against + the active venv works but couples the check to whichever attune-ai + version is installed. If a consumer is regenerating against an older + attune-ai, false negatives are possible. Acceptable tradeoff; + document. +3. **Phase 2 prompt-budget measurement**: do we measure context size in + characters, tokens, or both? Tokens are more accurate but require + running tokenizer. Start with characters; add tokenizer if drift + suggests it. diff --git a/docs/specs/polish-fact-check/requirements.md b/docs/specs/polish-fact-check/requirements.md new file mode 100644 index 0000000..7fd1f5b --- /dev/null +++ b/docs/specs/polish-fact-check/requirements.md @@ -0,0 +1,129 @@ +# Spec: Polish Fact-Check + +> Reduce attune-author polish-pass hallucinations through automated +> verification. Umbrella spec; ships as four sequential PRs (Phases 1–4). + +--- + +## Phase 1: Requirements + +**Status**: draft + +### Problem statement + +The attune-author polish pass routinely invents plausible-sounding surface +details that don't exist in the source files it was given. Concrete evidence +from a single feature regen (attune-ai's `ops-dashboard`, 2026-05-14, see +[attune-ai PR #351](https://github.com/Smart-AI-Memory/attune-ai/pull/351)): + +| Failure class | Count | Example | +|---|---|---| +| Hallucinated CLI flag | 1 | `--allow-run` (real: `--read-only`, inverted semantics) | +| Hallucinated private module path | 2 | `from attune.ops._readers import …` (`ModuleNotFoundError`) | +| Hallucinated cross-references | 4 | `Concept: Template design patterns` (no such doc) | +| Hallucinated count | 1 | `498 templates` (real: 259) | +| Wrong route path | 2 | `POST /run` (real: `POST /workflows/{name}/run`) | +| Insecure example | 1 | `host="0.0.0.0"` with no auth callout | + +Six distinct factual errors in a single feature's docs. Of these, three +(the CLI flag, the private import, and the wrong route) actively break +readers who follow the documentation literally. The current mitigation is +a manual editorial pass per feature — expensive, doesn't scale to the +remaining 9 stale features, and worse: it doesn't scale to a +weekly-or-faster regen cadence which is the whole premise of the living +help system. + +All six failure modes share a pattern: the LLM is filling in surrounding +scaffolding from priors rather than being constrained to the source +files it was given. The fix is to **shift verification work from human +review to automated checks**, while keeping the polish pass's freedom to +phrase, organize, and elaborate. + +### Scope + +**In scope:** + +- A four-phase intervention ladder, each phase shipping as its own PR: + 1. **AST-based post-generation fact-check** — Python-AST + CLI-help + + Markdown-link verification of polished output. Soft-fail (emit an + `## Unresolved references` block) initially; configurable to + strict-fail later. + 2. **Inject ground-truth context into the polish prompt** — for any + feature with a CLI surface, render `--help` output and inject it + under a `` sentinel tag in the prompt. Same for module + `__all__` and dataclass field lists. + 3. **Adapt the attune-rag faithfulness judge to polish output** — run + each polished file through `attune_rag.eval.faithfulness.FaithfulnessJudge` + against the source files; flag for review when score is below a + configurable threshold (default `0.95`). + 4. **Static analysis of tutorial code samples** — for `docs/tutorials/*.md` + specifically, extract Python code fences and run `mypy --strict` + + `ast.parse` against each. Catches the entire `_readers`/`_models` + hallucination class without executing untrusted code. +- A regression fixture: the ops-dashboard editorial pass diffs from + attune-ai PR #351 form a ground-truth set. Every check must catch the + errors that pass actually fixed. + +**Out of scope:** + +- Phase 4.2: actual execution of tutorial samples (Tier 1–3 sandboxing). + Discussed in the design doc as a future follow-up; gated on Phase 4.1 + data showing static analysis isn't sufficient. +- Polish prompt-engineering changes unrelated to ground-truth injection + (Phase 2 is narrowly scoped to context-injection, not prompt rewriting). +- Changes to the attune-rag faithfulness judge itself; Phase 3 only + *uses* the existing judge. +- Cost-side changes to the polish pass — none of the four phases changes + per-feature LLM cost beyond Phase 3's additional Haiku call (~$0.01-0.05 + per file). + +### Acceptance criteria + +**Per-phase exit criteria** are documented in `design.md` and `tasks.md`. +For the umbrella spec to be considered complete: + +1. **Phase 1 ships and the ops-dashboard regression fixture is reduced + from 6 errors → ≤1 error**. The remaining error is the + missing-security-callout for the `0.0.0.0` example — a genuinely + different failure shape (missing content, not wrong content) that + Phase 3 (faithfulness judge) is better suited to catch. Phase 1 + covers 5 of 6: Python refs, CLI refs, Markdown links, and **numeric + claims** (promoted from stretch to required per the decision matrix). +2. **Phases 2–4 ship in order**, each with its own PR and its own + regression delta against the same fixture. +3. **No regression in polish output quality** — measured by spot-checking + 3 features post-Phase-4 against pre-Phase-1 versions. The polish pass + should write *less* invented scaffolding, not less useful content. +4. **All four checks are configurable** — thresholds, severities, and + opt-out per-feature via `pyproject.toml` `[tool.attune-author.fact-check]`. + +### Non-goals / explicitly deferred + +- **Strict-fail by default in Phase 1**. The soft-fail default lets us + measure noise vs signal before tightening the gate. Pattern matches + the test-quality-program rubric's "measure first, gate later" + approach. +- **CI integration**. All four checks run during `attune-author generate` + / `attune-author regenerate`. CI integration (failing builds when + docs/ has unresolved references) is a follow-up after Phase 4 lands. +- **Generalizing beyond attune-ai's docs**. The fact-check operates on + any feature's generated templates; it doesn't assume the consumer is + attune-ai. But the regression fixture comes from attune-ai's + ops-dashboard, and we won't try to validate against arbitrary + third-party usage in this spec. + +### Decisions + +Pre-committed decisions live in [`decisions.md`](./decisions.md) and are +the source of truth. Calibration records and decision-change history +also live there. + +### Risks + +| Risk | Severity | Mitigation | +|---|---|---| +| AST checks produce too many false positives (soft-fail noise drowns signal) | Med | Soft-fail blocks at file bottom are scannable; track soft-fail rate per regen and tune resolvers | +| Phase 2 context injection blows polish prompt budget | Low | Measure context size before/after on the ops-dashboard fixture; cap injection at 5KB | +| Phase 3 faithfulness judge disagrees with our regression fixture | Med | Calibrate threshold against the ops-dashboard fixture before defaulting; document calibration | +| Phase 4 mypy false positives on legitimate `# type: ignore` patterns | Low | Allow `# attune-author: skip-mypy` frontmatter on individual samples | +| Bundled umbrella spec gates Phase 1 on broader approval than it needs | Low | Phase 1 framed as the "buy your way to value first" entry; explicitly approvable without committing to Phases 2–4 | diff --git a/docs/specs/polish-fact-check/tasks.md b/docs/specs/polish-fact-check/tasks.md new file mode 100644 index 0000000..d957c87 --- /dev/null +++ b/docs/specs/polish-fact-check/tasks.md @@ -0,0 +1,183 @@ +# Spec: Polish Fact-Check — Tasks + +## Phase 3: Tasks + +**Status**: draft + +> Four phases, each shipping as its own PR. Phase 1 is the recommended +> first commitment; Phases 2–4 build on it. Phase 1 can be approved and +> shipped independently of Phases 2–4. + +--- + +## Phase 1: AST-based post-generation fact-check + +**Target PR scope:** ~600 LOC including tests. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 1.1 | Decide config-file location (`pyproject.toml` vs new `.attune-author.toml`) | attune-author | todo | Match regen-pipeline convention; document decision in PR | +| 1.2 | Create `src/attune_author/fact_check/` package skeleton with `__init__.py`, `python_refs.py`, `cli_refs.py`, `md_links.py`, `numeric_refs.py`, `report.py` | attune-author | todo | One module per check + shared `FactCheckReport` dataclass | +| 1.3 | Implement `python_refs.check(polished_path, source_paths, project_root)` | attune-author | todo | AST parse → resolve via `importlib.import_module` in active venv | +| 1.4 | Implement `cli_refs.check(polished_path, project_root)` | attune-author | todo | Per-file cache of `attune --help` output; regex extract flag names. **Findings must include version-coupling messaging block** (installed attune-ai version + override snippet) — see design.md | +| 1.5 | Implement `md_links.check(polished_path, project_root)` | attune-author | todo | Resolve relative links; confirm target file exists | +| 1.5.1 | Implement `numeric_refs.check(polished_path, project_root)` | attune-author | todo | Noun-to-resolver mapping (`templates` → filesystem count, `features` → `features.yaml` key count, etc.). Severity: `error` on mismatch, `warning` on unverifiable nouns | +| 1.6 | Implement `report.format_unresolved_block(findings)` | attune-author | todo | Markdown table; severity column; appended above `` | +| 1.7 | Wire into `attune_author/polish.py` after the polish write | attune-author | todo | Soft-fail: append to file. Strict mode: raise `FactCheckError` | +| 1.8 | Add `[tool.attune-author.fact-check]` config schema + parser | attune-author | todo | `enabled`, `soft_fail`, per-check toggles, skip-list | +| 1.9 | Add `--fact-check=strict` / `--no-fact-check` CLI flags to `generate` and `regenerate` | attune-author | todo | Match existing CLI style | +| 1.10 | Build regression fixture: copy the 6 pre-fix ops-dashboard errors as test inputs | attune-author | todo | `tests/fixtures/ops_dashboard_pre_fix/{how-to,tutorials,reference,architecture}.md` | +| 1.11 | Test: each check fires on the matching fixture error | attune-author | todo | `test_python_refs_catches_underscore_module`, `test_cli_refs_catches_invented_flag`, `test_md_links_catches_missing_target`, `test_numeric_refs_catches_invented_count` | +| 1.11.1 | Test: CLI-ref finding contains version-coupling messaging | attune-author | todo | Assert installed version + override snippet appear in finding text | +| 1.12 | Test: zero findings on post-fix ops-dashboard versions | attune-author | todo | Pull from attune-ai PR #351 head | +| 1.13 | Test: soft-fail writes the block; strict mode raises | attune-author | todo | Two test cases | +| 1.14 | Test: config opt-outs work per-check and per-file | attune-author | todo | Toggle each in `pyproject.toml` test fixture | +| 1.15 | Update CHANGELOG with the four checks and the soft-fail default | attune-author | todo | Reference attune-ai PR #351 as motivation | +| 1.16 | Update README with a short "Fact-check" section + one example output | attune-author | todo | Keep it scannable; full docs go in attune-author's own help corpus later | + +### Phase 1 testing strategy + +- Pytest unit tests per check. Mock `importlib.import_module` only for + edge cases (e.g., a module that imports successfully but then raises); + prefer real imports against the actual attune package installed in + the test venv. +- Regression fixture frozen in-repo: the 4 pre-fix ops-dashboard docs + serve as ground truth. Test asserts that running fact-check on those + files produces ≥ the specific findings list in + `tests/fixtures/ops_dashboard_findings.yaml`. +- No external network in tests. `cli_refs` runs `attune --help` + against the locally-installed attune; this is acceptable because + attune-author already declares attune-ai as a dev dep. + +### Phase 1 exit checklist + +- [ ] All tasks 1.1–1.16 done +- [ ] CI green +- [ ] Regression fixture: **5/6 ops-dashboard errors caught** (Python + refs ×2 + CLI refs ×1 + Markdown links ×1 + numeric claims ×1). + The 6th error (missing-security-callout for `0.0.0.0`) is + explicitly Phase 3 scope. +- [ ] Zero findings on post-fix ops-dashboard versions +- [ ] CLI-ref findings include version-coupling messaging (verified by + test 1.11.1) +- [ ] CHANGELOG + README updated +- [ ] Spec status updated to `complete (Phase 1)` here + +--- + +## Phase 2: Ground-truth context injection + +**Target PR scope:** ~400 LOC including tests. Depends on Phase 1 (uses +the same regression fixture but is otherwise independent of fact-check +code). + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 2.1 | Add `cli_command` field to `Feature` (the manifest model) | attune-author | todo | Optional; absence skips CLI-help injection | +| 2.2 | Implement `ground_truth.extract_cli_help(cli_cmd, subcommand, project_root)` | attune-author | todo | `subprocess.run(...)` with timeout; cache per (cmd, subcommand) pair | +| 2.3 | Implement `ground_truth.extract_public_api(source_paths)` | attune-author | todo | AST-walk for `__all__` + non-underscore-prefixed defs | +| 2.4 | Implement `ground_truth.extract_dataclasses(source_paths)` | attune-author | todo | AST-walk for `@dataclass`; collect field names + type strings | +| 2.5 | Add ``, ``, `` sentinel blocks to polish prompt builder | attune-author | todo | Match existing context-block format | +| 2.6 | Add system-prompt anchoring clause | attune-author | todo | "Ground-truth context blocks contain surface details — names you use must appear verbatim" | +| 2.7 | Implement 5KB context budget enforcement with drop order | attune-author | todo | Log warning on drop; never fail | +| 2.8 | Add `[tool.attune-author.context-injection]` config + CLI flags | attune-author | todo | Defaults: all three sources on, 5KB budget | +| 2.9 | Test: ground-truth extractors produce expected output on ops-dashboard source | attune-author | todo | Snapshot tests | +| 2.10 | Test: polishing ops-dashboard with Phase 2 on, Phase 1 off recurs 0/3 high-severity errors | attune-author | todo | The acceptance gate from `design.md` | +| 2.11 | Test: budget enforcement drops sources in documented order | attune-author | todo | Artificial 1KB cap forces drops | +| 2.12 | Cost-delta measurement: 3-feature regression set with vs without Phase 2 | attune-author | todo | Record in CHANGELOG; should be < 10% | +| 2.13 | Update CHANGELOG + README | attune-author | todo | | + +### Phase 2 exit checklist + +- [ ] Tasks 2.1–2.13 done +- [ ] 0/3 high-severity ops-dashboard errors recur in Phase-2-only polish +- [ ] Cost delta < 10% +- [ ] Spec status updated + +--- + +## Phase 3: Faithfulness judge integration + +**Target PR scope:** ~300 LOC including tests. Depends on Phase 1 for +the `FactCheckReport` plumbing. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` | +| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` | +| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 | +| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers | +| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold | +| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log | +| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source | +| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | | +| 3.9 | Update CHANGELOG + README | attune-author | todo | | + +### Phase 3 exit checklist + +- [ ] Tasks 3.1–3.9 done +- [ ] Calibration shows clean separation between pre-fix and post-fix + fixture scores +- [ ] Threshold + cap configurable +- [ ] Spec status updated + +--- + +## Phase 4: Tutorial code-sample static check + +**Target PR scope:** ~250 LOC including tests. Depends on Phase 1 for +plumbing. + +| # | Task | Layer | Status | Notes | +|---|------|-------|--------|-------| +| 4.1 | Add `tutorial_static_check.check(polished_path, project_root)` to `fact_check/` package | attune-author | todo | Operates only on `docs/tutorials/*.md` | +| 4.2 | Code-fence extractor: pull all ```python fences with line numbers | attune-author | todo | Skip fences with `# attune-author: skip-mypy` first line | +| 4.3 | `ast.parse` each fence; collect syntax errors as findings | attune-author | todo | Cheap pre-check before invoking mypy | +| 4.4 | Run `mypy --strict --no-error-summary` per fence | attune-author | todo | Subprocess; timeout 10s; capture stderr | +| 4.5 | Parse mypy output into findings | attune-author | todo | Map line numbers back to original fence position | +| 4.6 | Strip `# attune-author: skip-mypy` directives before publication | attune-author | todo | Apply only to the file written; preserve in source if any | +| 4.7 | Add `[tool.attune-author.fact-check.tutorial_static]` config | attune-author | todo | `enabled`, `mypy_args`, `timeout_seconds` | +| 4.8 | Test: pre-fix `tutorials/ops-dashboard.md` flags `_readers` + `_models` imports | attune-author | todo | The headline acceptance gate | +| 4.9 | Test: post-fix version produces zero errors | attune-author | todo | | +| 4.10 | Test: `skip-mypy` directive is honored and stripped from output | attune-author | todo | | +| 4.11 | Test: total static-check time per tutorial < 10s | attune-author | todo | Bench against the ops-dashboard tutorial | +| 4.12 | Update CHANGELOG + README | attune-author | todo | Note Phase 4.2 (execution) explicitly deferred | +| 4.13 | Add design.md follow-up section on Phase 4.2 execution tiers | attune-author | todo | Reference the security + perf walkthrough from spec discussion | + +### Phase 4 exit checklist + +- [ ] Tasks 4.1–4.13 done +- [ ] Pre-fix fixture flagged correctly; post-fix clean +- [ ] Per-tutorial check time < 10s +- [ ] Spec status updated; full umbrella spec marked `complete` + +--- + +## Cross-phase notes + +### Testing strategy across all phases + +- One regression fixture (the ops-dashboard editorial pass diff) used + by all phases. Lives at `tests/fixtures/ops_dashboard_pre_fix/` and + `tests/fixtures/ops_dashboard_post_fix/`. +- No mocking of LLM calls in integration tests. Mock at the unit-test + level (`anthropic.Anthropic`) per the regen-pipeline pattern. +- CI runs all four checks on every PR after Phase 4 lands; before that, + each phase's CI lane is its own job. + +### Rollback strategy + +Each phase has its own `enabled = true|false` toggle. If a phase causes +unexpected breakage in production regens, the operator can disable it +in `pyproject.toml` without touching code. Phase 1 is the only phase +whose disablement loses notable safety; the others gracefully degrade +to "no extra check." + +### Sequencing rationale + +Why this order: Phase 1 is the cheapest (no LLM, deterministic) and +catches the most distinct error types. Phase 2 has higher impact per +LLM-token but only matters if Phase 1's findings show consistent +hallucination patterns. Phase 3 needs Phase 1's plumbing for the +report-block format. Phase 4 is tutorial-specific and benefits from +Phase 1's `FactCheckReport` already existing.