Skip to content

Add progressive_debate method: adversarial adjudication replaces consolidation#37

Open
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben:feat/progressive-debate
Open

Add progressive_debate method: adversarial adjudication replaces consolidation#37
r-uben wants to merge 2 commits intoChicagoHAI:mainfrom
r-uben:feat/progressive-debate

Conversation

@r-uben
Copy link
Contributor

@r-uben r-uben commented Mar 11, 2026

Summary

  • Adds progressive_debate review method that replaces monolithic consolidation with per-comment adversarial adjudication
  • Discovery phase is identical to the progressive method (sequential passage-by-passage with running summary)
  • For each raw comment, a Challenger (small model) argues it's not a real issue, then a Verdict (large model) decides keep/drop
  • Surviving comments are merged to remove duplicates

Motivation

The current consolidation step in review_progressive loses 43% of real findings on the Refine benchmark (44% → 25% LLM recall). It operates on raw JSON without paper context, so it cannot distinguish true positives from noise. See #35 and #36.

Benchmark Results (Seeded-Perturbation)

Metric Zero-Shot Progressive Debate
Recall 92% (11/12) 83% (10/12) 92% (11/12)
Total comments 10 32 34
False positives 5 (50%) 24 (75%) 28 (82%)

Key result: param_sigma and param_labor_share that consolidation pruned are correctly kept by adversarial adjudication. The debate dropped 16/54 comments (30%) through targeted challenge-verdict pairs, vs 72% by monolithic consolidation.

Usage

openaireview review paper.pdf --method debate
openaireview review paper.pdf --method debate_full  # pre-debate raw comments
openaireview review paper.pdf --method debate --small-model anthropic/claude-haiku-4-5

Files Changed

  • src/reviewer/method_debate.py — new method with challenge/verdict/merge prompts
  • src/reviewer/cli.py — wire up debate and debate_full methods + --small-model flag
  • benchmarks/seed_errors.py — add debate support to benchmark runner
  • benchmarks/REPORT.md — document seeded-perturbation benchmark with all three methods

Test plan

  • Module imports correctly (from reviewer.method_debate import review_progressive_debate)
  • Benchmark on seeded paper: 92% recall (11/12), matching zero-shot
  • Run on Refine benchmark papers to compare against progressive consolidation
  • Test with different provider APIs (OpenAI, Anthropic native, Gemini)

r-uben added 2 commits March 11, 2026 01:09
…olidation

Instead of a single LLM call to prune duplicates (which loses 43% of real
findings), each comment goes through a challenge-verdict debate:
  1. Challenger (small model): argues the finding is not a real issue
  2. Verdict (large model): weighs both sides, decides keep/drop
  3. Merge: cluster surviving duplicates

Addresses ChicagoHAI#35 (consolidation quality) and ChicagoHAI#36 (multi-agent debate proposal).
… drops

Debate matches zero-shot recall (11/12) while using progressive discovery.
Key result: param_sigma and param_labor_share that consolidation pruned
are correctly kept by adversarial adjudication. 16/54 comments dropped
by targeted challenge-verdict pairs vs 72% by monolithic consolidation.
@r-uben
Copy link
Contributor Author

r-uben commented Mar 11, 2026

Broader Vision: Multi-Agent Debate for Academic Review

This PR is a minimal implementation — one challenger, one judge, sequential processing. But the core idea (adversarial pressure improves calibration) has strong support in the literature and could be pushed much further.

What the literature says

Du et al. (2023), "Improving Factuality and Reasoning in Language Models through Multiagent Debate" showed that multiple LLM instances proposing, critiquing, and revising answers over several rounds significantly improves factual accuracy and reasoning. Their "society of minds" approach reduces hallucinations — exactly our false positive problem.

Chan et al. (2023), "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate" applied multi-agent debate specifically to evaluation tasks. Key finding: diverse role prompts (different personas) are essential — using the same role description degrades performance. This suggests our Reviewer/Challenger/Judge should have genuinely different perspectives, not just different instructions.

Zhang, Kraska & Khattab (2025), "Recursive Language Models" tackle the context length problem through recursive decomposition — processing documents in chunks with structured state passing between levels. Their RLM architecture maintains quality at 1M+ tokens where flat models degrade. Our progressive running summary is a crude version of this; a proper recursive architecture could handle much longer papers.

Tillman (2025), "Literature Review of Multi-Agent Debate for Problem Solving" synthesizes the field and identifies key open questions: how to terminate discussions, how systems scale with agents and rounds, and how communication structure affects performance. Our fixed 2-round protocol (challenge → verdict) is the simplest possible; the literature suggests dynamic termination (debate until convergence) and scaling to 3-5 agents improves results.

Concrete improvements

  1. Multi-round debate: Instead of challenge → verdict, allow rebuttal → counter-challenge → final verdict. The current miss (claim_persistent) might survive if the reviewer could defend its finding with additional evidence.

  2. Diverse agent personas: A Methodologist agent that focuses on statistical validity, a Domain Expert that checks against field conventions, and a Notation Checker that verifies mathematical consistency. ChatEval shows this diversity matters.

  3. Recursive context: The verdict currently gets a local window. A recursive architecture (RLM-style) could give it access to a compressed representation of the entire paper, fixing the context window limitation that caused the claim_persistent miss.

  4. Dynamic termination: Instead of fixed rounds, debate until the challenger can't find new counterarguments or the judge reaches high confidence. This naturally spends more compute on ambiguous cases and less on obvious errors.

  5. Confidence-calibrated output: The verdict already returns a confidence score (high/medium/low). This could drive which comments get additional debate rounds and how they're presented to the user.

The 92% recall on seeded errors is encouraging, but the real test is the Refine benchmark where progressive consolidation drops from 44% to 25%. If debate can close that gap, it would be a significant step toward LLMs that catch what human reviewers catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant