A benchmark harness that runs the same task through different agentic patterns and measures what actually matters: latency, cost, and quality.
Compare workflow patterns empirically to develop judgment about when to use each one.
uv run python examples/run_comparison.py| Task | Winner | Why |
|---|---|---|
| Simple Q&A | single_call | 2.4x faster, 2.8x cheaper, same quality |
| Content Gen | evaluator_optimize | 2x faster, hit word target (chain rambled) |
| Research | orchestrator | 2x faster, more focused output |
- Task is straightforward (Q&A, simple generation)
- Quality doesn't improve with decomposition
- Latency matters
- Steps genuinely build on each other
- Intermediate output adds value
- NOT just to "think more carefully" (model already does this)
- Output has clear criteria to evaluate against
- Refinement actually improves quality
- One focused revision beats sprawling first drafts
- Subtasks need coordination
- Results must be synthesized coherently
- Parallel isn't always faster - coordination overhead vs. unfocused bulk
- Note: my "parallel" ran sequentially for simplicity - true parallelism would be faster but might still produce unfocused bulk
More patterns != better results.
The orchestrator beat parallel because coordination produces focused output. The evaluator beat chaining because targeted refinement beats hoping more steps help.
Measure before assuming complexity adds value.
src/
├── metrics.py # TrackedClient for token/latency tracking
├── benchmark.py # Benchmark harness with LLM-as-judge (5x Haiku voting)
└── tasks/
├── simple_qa.py # single_call vs chain_call
├── content_gen.py # chain_generate vs evaluator_optimize
└── research.py # parallel_research vs orchestrator_research
- Quality: LLM-as-judge with 5 Haiku votes averaged (reduces noise)
- Cost: Claude Sonnet 4.5 pricing ($3/M input, $15/M output)
- Outputs: Saved to
results/outputs/for manual inspection