Skip to content

spec(docs): polish-fact-check — umbrella spec to reduce LLM polish-pass hallucinations#27

Open
silversurfer562 wants to merge 1 commit into
mainfrom
feat/polish-fact-check
Open

spec(docs): polish-fact-check — umbrella spec to reduce LLM polish-pass hallucinations#27
silversurfer562 wants to merge 1 commit into
mainfrom
feat/polish-fact-check

Conversation

@silversurfer562
Copy link
Copy Markdown
Member

Summary

Draft umbrella spec at docs/specs/polish-fact-check/. Proposes a four-phase intervention ladder to shift polish-pass verification work from manual editorial review to automated checks.

Motivation

Regression fixture from attune-ai#351: a single attune-author regen of one feature (ops-dashboard, 11 templates + 4 published-site docs) produced six distinct factual errors that required a manual editorial pass to fix:

Failure class Count Example
Hallucinated CLI flag 1 --allow-run (real: --read-only, inverted semantics)
Hallucinated private module path 2 from attune.ops._readers import … (`ModuleNotFoundError`)
Hallucinated cross-references 4 `Concept: Template design patterns` (no such doc)
Hallucinated count 1 `498 templates` (real: 259)
Wrong route path 2 `POST /run` (real: `POST /workflows/{name}/run`)
Insecure example 1 `host="0.0.0.0"` with no auth callout

Three of the six (CLI flag, private imports, wrong routes) actively break readers who follow the docs literally. The current mitigation (manual editorial review) doesn't scale to the 9 stale features queued or to the weekly regen cadence the living-help system requires.

Four phases, each shipping as its own PR

  1. AST-based post-generation fact-check — Python imports, CLI flags, Markdown links, numeric claims. Cheapest; no LLM cost. Catches 5 of 6 fixture errors.
  2. Ground-truth context injection into the polish prompt (CLI `--help` output, `all`, dataclass fields). Anchors hallucinations rather than blocking them.
  3. Adapt attune-rag faithfulness judge as a polish post-step. Catches the 6th fixture error (missing security callout). Configurable threshold; budget-capped.
  4. Static analysis of tutorial code samples (`mypy --strict` + `ast.parse`). Execution tiers explicitly deferred to Phase 4.2 for security reasons documented in design.md.

Files

  • `requirements.md` — problem statement, in-scope / out-of-scope, acceptance criteria, risks
  • `design.md` — architecture, per-phase API shapes, configuration schemas, three open design questions
  • `tasks.md` — numbered tasks per phase (1.1–1.16, 2.1–2.13, 3.1–3.9, 4.1–4.13), per-phase exit checklists, cross-phase testing strategy, rollback story
  • `decisions.md` — pre-committed decision matrix (introduces a new spec-file convention; previously decisions were captured inline in requirements/design)

Notes for review

  • Phase 1 is independently approvable. If Phases 2–4 stall or change scope, Phase 1 still ships 5/6 coverage of the fixture and is the highest-leverage single deliverable.
  • CLI-ref findings include version-coupling messaging (installed attune-ai version + override snippet) — see `design.md` § Check 2.
  • Phase 1 default failure mode is soft-fail (`## Unresolved references` block appended to file). Strict-fail escalation criterion is pre-committed in `decisions.md` (5% noise rate across two weekly regens).
  • Phase 3 threshold (0.95) is uncalibrated at draft time; calibration is task feat: tier-2 guidance meta-templates (quickstart, tip, note, comparison) #3.3 against the same regression fixture. Calibration record lives in `decisions.md`.
  • Phase 4 explicitly does not execute LLM-generated code in Phase 4.1. Execution tiers (subprocess, network-blocked, Docker) are documented for Phase 4.2 but deferred.

Status: `draft`. Awaiting review/approval before Phase 1 implementation begins.

🤖 Generated with Claude Code

…ucinations

Adds docs/specs/polish-fact-check/ — an umbrella spec for a
four-phase intervention ladder that shifts polish-pass
verification work from human editorial review to automated checks.

Motivated by a regression fixture from attune-ai PR #351
(Smart-AI-Memory/attune-ai#351), where the ops-dashboard regen
produced six factual errors in a single feature's docs:
  - 1 hallucinated CLI flag (--allow-run, real flag is --read-only)
  - 2 hallucinated private module paths (attune.ops._readers)
  - 4 hallucinated cross-references
  - 1 hallucinated count (498 templates vs real 259)
  - 2 wrong route paths
  - 1 insecure example (0.0.0.0 without auth callout)

Three of the six (CLI flag, private imports, wrong routes) would
actively break readers who follow the docs literally.

Four phases, each shipping as its own PR:
  Phase 1: AST-based post-generation fact-check (Python refs,
           CLI flags, Markdown links, numeric claims) — catches
           5 of 6 fixture errors. Cheapest, no LLM cost.
  Phase 2: Ground-truth context injection into polish prompt
           (CLI --help output, __all__, dataclass fields).
  Phase 3: Adapt attune-rag faithfulness judge as a polish
           post-step. Catches the 6th fixture error
           (missing-security-callout).
  Phase 4: Static analysis of tutorial code samples (mypy +
           ast.parse). Execution tiers explicitly deferred to
           Phase 4.2 for security reasons.

Files:
  - requirements.md — problem statement, scope, acceptance
  - design.md       — architecture, per-phase API shapes,
                      open design questions
  - tasks.md        — numbered tasks per phase, exit checklists
  - decisions.md    — pre-committed decision matrix
                      (introduces a spec-file convention)

Status: draft. Awaiting review/approval before Phase 1
implementation begins.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant