Skip to content

Add seeded-perturbation benchmark for ground-truth evaluation#34

Open
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben:feat/seeded-benchmark
Open

Add seeded-perturbation benchmark for ground-truth evaluation#34
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben:feat/seeded-benchmark

Conversation

@r-uben
Copy link
Contributor

@r-uben r-uben commented Mar 10, 2026

Summary

Addresses the evaluation gap identified in the ROADMAP: "One of the big problems in this space is that there is no public benchmark for what thorough reviews should look like."

The existing Refine benchmark uses expert comments as ground truth, but matching is noisy and the errors are subjective. This PR adds a complementary seeded-perturbation benchmark with known ground truth: take a clean paper, programmatically inject specific errors, run the reviewer, and measure exactly what it catches.

How it works

  1. Start with a clean, math-heavy academic paper (Nakamura & Steinsson 2018, QJE, 48 pages)
  2. Inject 12 known errors across 5 categories:
    • Sign flips (3): flipped minus to plus in Euler equation, removed leading negative in habits equation, changed kappa*omega to kappa/omega
    • Parameter errors (5): beta 0.99->0.95, sigma 0.5->5, habit b 0.9->0.09, phi_pi 0.01->1.5, labor share 2/3->1/3
    • Definition errors (2): output gap y_t->c_t, information fraction 1-psi->1+psi
    • Subscript swap (1): lambda_t^n -> lambda_s^n
    • Overstated claim (1): "persistent" -> "permanent"
  3. Run the reviewer on the perturbed paper
  4. Score: which injected errors were detected? How many false positives?

Results: Zero-Shot (Claude Opus 4.6)

Metric Value
Injected errors 12
Detected 11 (92%)
Missed 1
Total comments 10
False positives 5 (50%)
Category Recall
Sign flips 3/3 (100%)
Parameter errors 5/5 (100%)
Definition errors 2/2 (100%)
Subscript swaps 1/1 (100%)
Overstated claims 0/1 (0%)

The only miss was the claim overstatement ("persistent" -> "permanent"), which requires economic judgment rather than mathematical consistency checking. Of the 5 "false positives," several flag real issues in the original paper (e.g., a standard error discrepancy between text and table).

What's included

  • benchmarks/seed_errors.py: error injection + automated scoring framework
  • benchmarks/REPORT.md: updated with methodology, results, and analysis

Extending

The framework is designed to be reusable:

  • Each error is a simple SeededError(id, category, original, perturbed, hint) dataclass
  • Adding errors for a new paper = defining new entries in the ERRORS list
  • Scoring is automated via keyword matching between comments and injected error regions
  • Supports all review methods (--method zero_shot/progressive/progressive_full)

Test plan

  • Dry run: all 12/12 errors injected successfully
  • Zero-shot benchmark: 92% recall, results reproducible
  • Progressive benchmark: run in progress (will update)

Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018)
and measures reviewer recall against ground truth. Error categories:
sign flips, parameter errors, definition inconsistencies, subscript
swaps, and overstated claims.

Zero-shot result: 92% recall (11/12 detected), only missing an
overstated claim. All math and parameter errors caught with substantive
explanations.

Includes:
- benchmarks/seed_errors.py: injection + scoring framework
- benchmarks/REPORT.md: updated with seeded benchmark methodology and results
Progressive catches claim overstatement that zero-shot missed but loses
two parameter errors during consolidation. Documents the complementary
strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant