spec(docs): polish-fact-check — umbrella spec to reduce LLM polish-pass hallucinations by silversurfer562 · Pull Request #27 · Smart-AI-Memory/attune-author

silversurfer562 · 2026-05-14T17:55:45Z

Summary

Draft umbrella spec at docs/specs/polish-fact-check/. Proposes a four-phase intervention ladder to shift polish-pass verification work from manual editorial review to automated checks.

Motivation

Regression fixture from attune-ai#351: a single attune-author regen of one feature (ops-dashboard, 11 templates + 4 published-site docs) produced six distinct factual errors that required a manual editorial pass to fix:

Failure class	Count	Example
Hallucinated CLI flag	1	`--allow-run` (real: `--read-only`, inverted semantics)
Hallucinated private module path	2	`from attune.ops._readers import …` (`ModuleNotFoundError`)
Hallucinated cross-references	4	`Concept: Template design patterns` (no such doc)
Hallucinated count	1	`498 templates` (real: 259)
Wrong route path	2	`POST /run` (real: `POST /workflows/{name}/run`)
Insecure example	1	`host="0.0.0.0"` with no auth callout

Three of the six (CLI flag, private imports, wrong routes) actively break readers who follow the docs literally. The current mitigation (manual editorial review) doesn't scale to the 9 stale features queued or to the weekly regen cadence the living-help system requires.

Four phases, each shipping as its own PR

AST-based post-generation fact-check — Python imports, CLI flags, Markdown links, numeric claims. Cheapest; no LLM cost. Catches 5 of 6 fixture errors.
Ground-truth context injection into the polish prompt (CLI `--help` output, `all`, dataclass fields). Anchors hallucinations rather than blocking them.
Adapt attune-rag faithfulness judge as a polish post-step. Catches the 6th fixture error (missing security callout). Configurable threshold; budget-capped.
Static analysis of tutorial code samples (`mypy --strict` + `ast.parse`). Execution tiers explicitly deferred to Phase 4.2 for security reasons documented in design.md.

Files

`requirements.md` — problem statement, in-scope / out-of-scope, acceptance criteria, risks
`design.md` — architecture, per-phase API shapes, configuration schemas, three open design questions
`tasks.md` — numbered tasks per phase (1.1–1.16, 2.1–2.13, 3.1–3.9, 4.1–4.13), per-phase exit checklists, cross-phase testing strategy, rollback story
`decisions.md` — pre-committed decision matrix (introduces a new spec-file convention; previously decisions were captured inline in requirements/design)

Notes for review

Phase 1 is independently approvable. If Phases 2–4 stall or change scope, Phase 1 still ships 5/6 coverage of the fixture and is the highest-leverage single deliverable.
CLI-ref findings include version-coupling messaging (installed attune-ai version + override snippet) — see `design.md` § Check 2.
Phase 1 default failure mode is soft-fail (`## Unresolved references` block appended to file). Strict-fail escalation criterion is pre-committed in `decisions.md` (5% noise rate across two weekly regens).
Phase 3 threshold (0.95) is uncalibrated at draft time; calibration is task feat: tier-2 guidance meta-templates (quickstart, tip, note, comparison) #3.3 against the same regression fixture. Calibration record lives in `decisions.md`.
Phase 4 explicitly does not execute LLM-generated code in Phase 4.1. Execution tiers (subprocess, network-blocked, Docker) are documented for Phase 4.2 but deferred.

Status: `draft`. Awaiting review/approval before Phase 1 implementation begins.

🤖 Generated with Claude Code

…ucinations Adds docs/specs/polish-fact-check/ — an umbrella spec for a four-phase intervention ladder that shifts polish-pass verification work from human editorial review to automated checks. Motivated by a regression fixture from attune-ai PR #351 (Smart-AI-Memory/attune-ai#351), where the ops-dashboard regen produced six factual errors in a single feature's docs: - 1 hallucinated CLI flag (--allow-run, real flag is --read-only) - 2 hallucinated private module paths (attune.ops._readers) - 4 hallucinated cross-references - 1 hallucinated count (498 templates vs real 259) - 2 wrong route paths - 1 insecure example (0.0.0.0 without auth callout) Three of the six (CLI flag, private imports, wrong routes) would actively break readers who follow the docs literally. Four phases, each shipping as its own PR: Phase 1: AST-based post-generation fact-check (Python refs, CLI flags, Markdown links, numeric claims) — catches 5 of 6 fixture errors. Cheapest, no LLM cost. Phase 2: Ground-truth context injection into polish prompt (CLI --help output, __all__, dataclass fields). Phase 3: Adapt attune-rag faithfulness judge as a polish post-step. Catches the 6th fixture error (missing-security-callout). Phase 4: Static analysis of tutorial code samples (mypy + ast.parse). Execution tiers explicitly deferred to Phase 4.2 for security reasons. Files: - requirements.md — problem statement, scope, acceptance - design.md — architecture, per-phase API shapes, open design questions - tasks.md — numbered tasks per phase, exit checklists - decisions.md — pre-committed decision matrix (introduces a spec-file convention) Status: draft. Awaiting review/approval before Phase 1 implementation begins. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

silversurfer562 mentioned this pull request May 14, 2026

docs(CLAUDE.md): three lessons — pygments fix landed, LLM hallucination shapes, stash pop silent-skip Smart-AI-Memory/attune-ai#356

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(docs): polish-fact-check — umbrella spec to reduce LLM polish-pass hallucinations#27

spec(docs): polish-fact-check — umbrella spec to reduce LLM polish-pass hallucinations#27
silversurfer562 wants to merge 1 commit into
mainfrom
feat/polish-fact-check

silversurfer562 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silversurfer562 commented May 14, 2026

Summary

Motivation

Four phases, each shipping as its own PR

Files

Notes for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant