Add seeded-perturbation benchmark for ground-truth evaluation by r-uben · Pull Request #34 · ChicagoHAI/OpenAIReview

r-uben · 2026-03-10T23:11:21Z

Summary

Addresses the evaluation gap identified in the ROADMAP: "One of the big problems in this space is that there is no public benchmark for what thorough reviews should look like."

The existing Refine benchmark uses expert comments as ground truth, but matching is noisy and the errors are subjective. This PR adds a complementary seeded-perturbation benchmark with known ground truth: take a clean paper, programmatically inject specific errors, run the reviewer, and measure exactly what it catches.

How it works

Start with a clean, math-heavy academic paper (Nakamura & Steinsson 2018, QJE, 48 pages)
Inject 12 known errors across 5 categories:
- Sign flips (3): flipped minus to plus in Euler equation, removed leading negative in habits equation, changed kappa*omega to kappa/omega
- Parameter errors (5): beta 0.99->0.95, sigma 0.5->5, habit b 0.9->0.09, phi_pi 0.01->1.5, labor share 2/3->1/3
- Definition errors (2): output gap y_t->c_t, information fraction 1-psi->1+psi
- Subscript swap (1): lambda_t^n -> lambda_s^n
- Overstated claim (1): "persistent" -> "permanent"
Run the reviewer on the perturbed paper
Score: which injected errors were detected? How many false positives?

Results: Zero-Shot (Claude Opus 4.6)

Metric	Value
Injected errors	12
Detected	11 (92%)
Missed	1
Total comments	10
False positives	5 (50%)

Category	Recall
Sign flips	3/3 (100%)
Parameter errors	5/5 (100%)
Definition errors	2/2 (100%)
Subscript swaps	1/1 (100%)
Overstated claims	0/1 (0%)

The only miss was the claim overstatement ("persistent" -> "permanent"), which requires economic judgment rather than mathematical consistency checking. Of the 5 "false positives," several flag real issues in the original paper (e.g., a standard error discrepancy between text and table).

What's included

benchmarks/seed_errors.py: error injection + automated scoring framework
benchmarks/REPORT.md: updated with methodology, results, and analysis

Extending

The framework is designed to be reusable:

Each error is a simple SeededError(id, category, original, perturbed, hint) dataclass
Adding errors for a new paper = defining new entries in the ERRORS list
Scoring is automated via keyword matching between comments and injected error regions
Supports all review methods (--method zero_shot/progressive/progressive_full)

Test plan

Dry run: all 12/12 errors injected successfully
Zero-shot benchmark: 92% recall, results reproducible
Progressive benchmark: run in progress (will update)

Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018) and measures reviewer recall against ground truth. Error categories: sign flips, parameter errors, definition inconsistencies, subscript swaps, and overstated claims. Zero-shot result: 92% recall (11/12 detected), only missing an overstated claim. All math and parameter errors caught with substantive explanations. Includes: - benchmarks/seed_errors.py: injection + scoring framework - benchmarks/REPORT.md: updated with seeded benchmark methodology and results

Progressive catches claim overstatement that zero-shot missed but loses two parameter errors during consolidation. Documents the complementary strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).

This was referenced Mar 10, 2026

Consolidation loses 43% of real findings — adversarial adjudication proposal #35

Open

Multi-agent debate method for paper review #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add seeded-perturbation benchmark for ground-truth evaluation#34

Add seeded-perturbation benchmark for ground-truth evaluation#34
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben:feat/seeded-benchmark

r-uben commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

r-uben commented Mar 10, 2026

Summary

How it works

Results: Zero-Shot (Claude Opus 4.6)

What's included

Extending

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant