Skip to content

test: harden demo oracles and clarify results docs#124

Merged
rlippmann merged 17 commits intomainfrom
codex/0.6.15
May 6, 2026
Merged

test: harden demo oracles and clarify results docs#124
rlippmann merged 17 commits intomainfrom
codex/0.6.15

Conversation

@rlippmann
Copy link
Copy Markdown
Owner

What changed

  • Hardened demo oracle/checker logic against wording-variant false positives in scored demos:
    • Demo 02 prohibited-content negation handling
    • Demo 03 stale-premise negation handling
    • Demo 04 tool tag normalization robustness
    • Demo 05/07 premise-tag normalization robustness
  • Expanded oracle-focused tests (including Hypothesis/property coverage) and baseline/comparison parity checks across demo paths.
  • Updated documentation to make demo evidence easier to discover while reducing duplication:
    • Added a concise top-level "Does it work?" summary in README.md
    • Added canonical demo-results reference page: docs/demos-results.md
    • Linked README.md and demos/README.md to canonical results
    • Replaced weak Demo 05 example with a real long-context drift example snippet
    • Removed redundant/stale evidence summaries from README.md

Why

  • Live runs exposed repeated false failures caused by brittle wording-sensitive oracles rather than engine behavior.
  • The oracle hardening and property tests reduce regressions from harmless text variants while preserving strict failure behavior for genuinely unsafe/incorrect outputs.
  • Documentation updates provide a single source of truth for results and improve discoverability without duplicating matrices across files.

Checklist

  • pre-commit run (uv run pre-commit run --all-files)
  • tests pass (uv run pytest)

@rlippmann rlippmann merged commit 1ba708c into main May 6, 2026
12 checks passed
@rlippmann rlippmann deleted the codex/0.6.15 branch May 6, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant