Add progressive_debate method: adversarial adjudication replaces consolidation#37
Add progressive_debate method: adversarial adjudication replaces consolidation#37r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
Conversation
…olidation Instead of a single LLM call to prune duplicates (which loses 43% of real findings), each comment goes through a challenge-verdict debate: 1. Challenger (small model): argues the finding is not a real issue 2. Verdict (large model): weighs both sides, decides keep/drop 3. Merge: cluster surviving duplicates Addresses ChicagoHAI#35 (consolidation quality) and ChicagoHAI#36 (multi-agent debate proposal).
… drops Debate matches zero-shot recall (11/12) while using progressive discovery. Key result: param_sigma and param_labor_share that consolidation pruned are correctly kept by adversarial adjudication. 16/54 comments dropped by targeted challenge-verdict pairs vs 72% by monolithic consolidation.
Broader Vision: Multi-Agent Debate for Academic ReviewThis PR is a minimal implementation — one challenger, one judge, sequential processing. But the core idea (adversarial pressure improves calibration) has strong support in the literature and could be pushed much further. What the literature saysDu et al. (2023), "Improving Factuality and Reasoning in Language Models through Multiagent Debate" showed that multiple LLM instances proposing, critiquing, and revising answers over several rounds significantly improves factual accuracy and reasoning. Their "society of minds" approach reduces hallucinations — exactly our false positive problem. Chan et al. (2023), "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate" applied multi-agent debate specifically to evaluation tasks. Key finding: diverse role prompts (different personas) are essential — using the same role description degrades performance. This suggests our Reviewer/Challenger/Judge should have genuinely different perspectives, not just different instructions. Zhang, Kraska & Khattab (2025), "Recursive Language Models" tackle the context length problem through recursive decomposition — processing documents in chunks with structured state passing between levels. Their RLM architecture maintains quality at 1M+ tokens where flat models degrade. Our progressive running summary is a crude version of this; a proper recursive architecture could handle much longer papers. Tillman (2025), "Literature Review of Multi-Agent Debate for Problem Solving" synthesizes the field and identifies key open questions: how to terminate discussions, how systems scale with agents and rounds, and how communication structure affects performance. Our fixed 2-round protocol (challenge → verdict) is the simplest possible; the literature suggests dynamic termination (debate until convergence) and scaling to 3-5 agents improves results. Concrete improvements
The 92% recall on seeded errors is encouraging, but the real test is the Refine benchmark where progressive consolidation drops from 44% to 25%. If debate can close that gap, it would be a significant step toward LLMs that catch what human reviewers catch. |
Summary
progressive_debatereview method that replaces monolithic consolidation with per-comment adversarial adjudicationMotivation
The current consolidation step in
review_progressiveloses 43% of real findings on the Refine benchmark (44% → 25% LLM recall). It operates on raw JSON without paper context, so it cannot distinguish true positives from noise. See #35 and #36.Benchmark Results (Seeded-Perturbation)
Key result:
param_sigmaandparam_labor_sharethat consolidation pruned are correctly kept by adversarial adjudication. The debate dropped 16/54 comments (30%) through targeted challenge-verdict pairs, vs 72% by monolithic consolidation.Usage
openaireview review paper.pdf --method debate openaireview review paper.pdf --method debate_full # pre-debate raw comments openaireview review paper.pdf --method debate --small-model anthropic/claude-haiku-4-5Files Changed
src/reviewer/method_debate.py— new method with challenge/verdict/merge promptssrc/reviewer/cli.py— wire updebateanddebate_fullmethods +--small-modelflagbenchmarks/seed_errors.py— add debate support to benchmark runnerbenchmarks/REPORT.md— document seeded-perturbation benchmark with all three methodsTest plan
from reviewer.method_debate import review_progressive_debate)