Add seeded-perturbation benchmark for ground-truth evaluation#34
Open
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
Open
Add seeded-perturbation benchmark for ground-truth evaluation#34r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
Conversation
Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018) and measures reviewer recall against ground truth. Error categories: sign flips, parameter errors, definition inconsistencies, subscript swaps, and overstated claims. Zero-shot result: 92% recall (11/12 detected), only missing an overstated claim. All math and parameter errors caught with substantive explanations. Includes: - benchmarks/seed_errors.py: injection + scoring framework - benchmarks/REPORT.md: updated with seeded benchmark methodology and results
This was referenced Mar 10, 2026
Progressive catches claim overstatement that zero-shot missed but loses two parameter errors during consolidation. Documents the complementary strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses the evaluation gap identified in the ROADMAP: "One of the big problems in this space is that there is no public benchmark for what thorough reviews should look like."
The existing Refine benchmark uses expert comments as ground truth, but matching is noisy and the errors are subjective. This PR adds a complementary seeded-perturbation benchmark with known ground truth: take a clean paper, programmatically inject specific errors, run the reviewer, and measure exactly what it catches.
How it works
Results: Zero-Shot (Claude Opus 4.6)
The only miss was the claim overstatement ("persistent" -> "permanent"), which requires economic judgment rather than mathematical consistency checking. Of the 5 "false positives," several flag real issues in the original paper (e.g., a standard error discrepancy between text and table).
What's included
benchmarks/seed_errors.py: error injection + automated scoring frameworkbenchmarks/REPORT.md: updated with methodology, results, and analysisExtending
The framework is designed to be reusable:
SeededError(id, category, original, perturbed, hint)dataclassERRORSlist--method zero_shot/progressive/progressive_full)Test plan