How to judge whether an agentic debugging system succeeds on this suite.
Agentic debugging ≠ code suggestion.
A system demonstrates agentic capability when it:
- Runs the code to observe failure (not just reads it)
- Reasons about root cause (not just pattern-matches errors)
- Applies a fix and verifies it works (not just suggests changes)
| Criterion | Points | What It Means |
|---|---|---|
| Reproduction | 1 | Executed the code/tests to see the actual error |
| Localization | 2 | Identified the correct file(s) and approximate location |
| Root Cause | 2 | Explained why the bug occurs, not just what to change |
| Minimal Fix | 2 | Changed only what's necessary (no over-engineering) |
| Verification | 2 | Re-ran tests/build to confirm the fix works |
| No Regressions | 1 | Didn't break other functionality |
| Outcome | Score | Explanation |
|---|---|---|
| Perfect fix, verified | 10/10 | Full marks |
| Correct fix, didn't verify | 8/10 | Missing verification step |
| Fixed symptom, not root cause | 5/10 | Localized but shallow fix |
| Suggested fix without running | 3/10 | Not agentic—just reactive |
| Wrong diagnosis | 0-2/10 | Failed to understand the bug |
When comparing systems across the full suite:
| Metric | Formula | What It Shows |
|---|---|---|
| Pass Rate | Projects with score ≥ 8 / Total | Reliability |
| Partial Rate | Projects with score 4-7 / Total | Potential (needs refinement) |
| Fail Rate | Projects with score < 4 / Total | Fundamental gaps |
| Avg Fix Size | Lines changed (median) | Efficiency / precision |
| Avg Tool Calls | Tool invocations per project | Operational efficiency |
| Cross-Domain Score | Score on cross_domain/ projects |
Advanced reasoning |
Not all projects are equal. Expectations scale with difficulty:
| Difficulty | Expected Pass Rate | Notes |
|---|---|---|
| ⭐ | 95%+ | Baseline competence |
| ⭐⭐ | 80%+ | Normal daily bugs |
| ⭐⭐⭐ | 60%+ | Requires cross-file reasoning |
| ⭐⭐⭐⭐ | 40%+ | Needs hypothesis generation |
| ⭐⭐⭐⭐⭐ | Any success impressive | Platform + toolchain + code |
- Tests pass (if project has tests)
- Build succeeds (if compile-time errors)
- Expected output matches README description
- No new errors introduced
- Main bug fixed but edge cases remain
- Fix works but is not minimal
- Correct diagnosis but imperfect implementation
- Wrong diagnosis
- Fix introduces new bugs
- Didn't run/verify the fix
- Gave up without meaningful progress
For systems that go beyond basic fixing:
| Bonus | Description |
|---|---|
| Explanation Quality | Clear, accurate explanation a human could learn from |
| Hypothesis Reasoning | Explicitly formed and tested hypotheses |
| Confidence Scoring | Indicated certainty level before/after fix |
| Preventive Suggestions | Recommended how to avoid similar bugs |
These don't affect the core 0-10 score but indicate maturity.
Signs that a system isn't truly "agentic":
| Anti-Pattern | Why It's Bad |
|---|---|
| Grep for "BUG:" comments | Cheating (this suite has none) |
| Fix without running | No verification = no confidence |
| Change test assertions | Hiding the bug, not fixing it |
| Massive refactors | Over-engineering, likely broke something |
| Give up after first error | No persistence or hypothesis iteration |
When publishing benchmark results:
## [System Name] Results
| Metric | Value |
|--------|-------|
| Projects Attempted | X/24 |
| Full Pass (≥8) | X |
| Partial (4-7) | X |
| Failed (<4) | X |
| Avg Score | X.X/10 |
### By Difficulty
| ⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|----|------|--------|----------|------------|
| X% | X% | X% | X% | X% |
### Notable Results
- [Highlight any interesting successes or failures]This rubric is intentionally simple and observable.
We don't measure:
- "Intelligence" (unmeasurable)
- "Creativity" (subjective)
- "Speed" (hardware-dependent)
We measure:
- Did it find the bug?
- Did it fix the bug?
- Did it verify the fix?
That's what matters in production.