Skip to content

Benchmark: blast radius (Track A scream + Track B agent comparison) #143

@jonathanpopham

Description

@jonathanpopham

Tracking the blast radius benchmark that will back the next deep-dive article.

Status (2026-04-23)

Track A — scream test (graph vs reality)

Done. Standalone Python harness that mutates a function, runs Django's test suite, compares what breaks against /v1/analysis/impact predictions.

  • Scope: django/contrib/auth/ — 18 measured targets
  • Macro F1: 56.9% (precision 44.2%, recall 94.9%)
  • Micro F1: 59.9% (precision 44.2%, recall 93.0%) — TP 159, FP 201, FN 12
  • Strongest performers: hashers.make_password / hashers.get_hasher — 100% recall, 85% precision, 92% F1
  • Known gap: 12 FN from login_required / user_passes_test decorators — parser appears to miss test methods decorated at call time

Artifacts in benchmark/scream/:

  • scream_test.py — AST-based mutation loop
  • api_predict.py — batched call to /v1/analysis/impact
  • compare.py — precision/recall/F1 (scoped + test-case-only filtering)
  • summarize.py — metrics → markdown
  • results/metrics_fair.json + results/README.md

Track B — agent comparison (blast radius context → refactor outcome)

Pilot complete, scope problem identified.

Setup: Two Docker containers (bench-br-naked, bench-br-supermodel) built on Django 5.0.6 + Claude Code + Opus 4.7 (pinned via `--model claude-opus-4-7`). Task: make `authenticate(request=None, **credentials)` require `request`, keep tests green.

Pilot run #1 (test scope = `auth_tests` only):

condition verdict turns tools cost duration files
naked PASS 58 41 $1.56 205s 7
supermodel PASS 59 43 $1.77 203s 7

Effectively tied. Opus 4.7 is capable enough to grep-and-iterate within a single subsystem. The BLAST_RADIUS.md context added cost (bigger input) without changing outcome.

Pilot run #2 in progress (test scope expanded to 13 Django subsystems, 3,293 tests — auth_tests, admin_views, admin_changelist, admin_utils, admin_inlines, sessions_tests, middleware_exceptions, generic_views, forms_tests, contenttypes_tests, handlers, view_tests, test_client_regress). Hypothesis: naked agent will now miss callers outside auth/.

Artifacts in `benchmark/`:

  • `Dockerfile.br-naked`, `Dockerfile.br-supermodel`
  • `entrypoint.br.sh`
  • `CLAUDE.br-naked.md`, `CLAUDE.br-supermodel.md`
  • `br_task.md` — refactor spec
  • `BLAST_RADIUS.md` — pre-computed from Track A's `predicted.json`
  • `run-br.sh` — orchestrator
  • `compare-br.sh` — results table
  • `results/br/`

Known issues to resolve

  • Expanded-scope pilot may still tie. If so, pivot to:
    • Measure cross-subsystem risk acknowledgement in agent summary (qualitative)
    • Pick a task with polymorphic / getattr-based callers (harder for grep)
    • Pin a weaker model (Sonnet 4.6) to widen the gap
  • BLAST_RADIUS.md contains obvious noise (migration files listed as 'callers'). Tighten filtering in the rendering script.
  • Track A recall gaps on decorator targets — parser investigation.
  • Full Django test suite hits `TypeError: cannot pickle 'traceback' object` under `--parallel 4` on macOS. Workaround: `--parallel 1` and scoped subsystems. Worth filing upstream.

Article plan

Deep-dive: Your agent thinks it's editing 3 files. It's editing 145. — receipt-worthy "things that will break" number, Jenga/ripple analogy, push `/v1/analysis/impact` + `supermodel blast-radius`.

Standalone pullout (4th-grade brainrot): "Before you move one block, check who's standing on it."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions