Skip to content

inkybubble/agents-03-workflow-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workflow Pattern Comparison

A benchmark harness that runs the same task through different agentic patterns and measures what actually matters: latency, cost, and quality.

Compare workflow patterns empirically to develop judgment about when to use each one.

Quick Start

uv run python examples/run_comparison.py

Results Summary

Task Winner Why
Simple Q&A single_call 2.4x faster, 2.8x cheaper, same quality
Content Gen evaluator_optimize 2x faster, hit word target (chain rambled)
Research orchestrator 2x faster, more focused output

Decision Framework

Use Single Call When:

  • Task is straightforward (Q&A, simple generation)
  • Quality doesn't improve with decomposition
  • Latency matters

Use Chaining When:

  • Steps genuinely build on each other
  • Intermediate output adds value
  • NOT just to "think more carefully" (model already does this)

Use Evaluator-Optimizer When:

  • Output has clear criteria to evaluate against
  • Refinement actually improves quality
  • One focused revision beats sprawling first drafts

Use Orchestrator (not Parallel) When:

  • Subtasks need coordination
  • Results must be synthesized coherently
  • Parallel isn't always faster - coordination overhead vs. unfocused bulk
  • Note: my "parallel" ran sequentially for simplicity - true parallelism would be faster but might still produce unfocused bulk

Key Insight

More patterns != better results.

The orchestrator beat parallel because coordination produces focused output. The evaluator beat chaining because targeted refinement beats hoping more steps help.

Measure before assuming complexity adds value.

Project Structure

src/
├── metrics.py      # TrackedClient for token/latency tracking
├── benchmark.py    # Benchmark harness with LLM-as-judge (5x Haiku voting)
└── tasks/
    ├── simple_qa.py    # single_call vs chain_call
    ├── content_gen.py  # chain_generate vs evaluator_optimize
    └── research.py     # parallel_research vs orchestrator_research

Benchmark Details

  • Quality: LLM-as-judge with 5 Haiku votes averaged (reduces noise)
  • Cost: Claude Sonnet 4.5 pricing ($3/M input, $15/M output)
  • Outputs: Saved to results/outputs/ for manual inspection

About

Benchmark harness comparing agentic workflow patterns on latency, cost, and quality

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages