Achieved #1 on Terminal Bench Leaderboard 🏆
New SOTA: 64.50% ± 1.77 success rate (5 runs, 80 tasks each), surpassing Terminus 2 with Claude Sonnet 4.5 by 13.5 percentage points.
Apex2 represents a fundamental rethinking of agentic coding systems. Through strategic simplification and intelligent parallelization, we achieved state-of-the-art performance while reducing system complexity by 90%. Built as a weekend project, this work demonstrates that thoughtful architecture and sophisticated prompting beat complex multi-agent orchestration.
Terminal Bench Leaderboard 11/03/25
Apex2 achieves 64.50% accuracy with Claude Sonnet 4.5, surpassing the current #1 (Ante at 60.3%) by 4.2 percentage points.
5-run average: 64.50% ± 1.77%
| Run | Success Rate |
|---|---|
| apex2-sonnet-4-5-rc9-1 | 62.50% |
| apex2-sonnet-4-5-rc9-2 | 63.75% |
| apex2-sonnet-4-5-rc9-3 | 65.00% |
| apex2-sonnet-4-5-rc9-4 | 66.25% |
| apex2-sonnet-4-5-rc9-5 | 65.00% |
5-run average: 49.25% ± 1.39%
| Run | Success Rate |
|---|---|
| apex2-gpt-5-rc3-1 | 50.00% |
| apex2-gpt-5-rc3-2 | 48.75% |
| apex2-gpt-5-rc3-3 | 50.00% |
| apex2-gpt-5-rc3-4 | 50.00% |
| apex2-gpt-5-rc3-5 | 47.50% |
| Agent | Accuracy |
|---|---|
| Apex2 | 64.50% ± 1.77% |
| Ante (Previous SOTA) | 60.3% ± 1.1% |
| Droid | 57.5% ± 0.8% |
| Chaterm | 52.5% ± 0.5% |
| Terminus 2 | 51.0% ± 0.8% |
| Agent | Accuracy |
|---|---|
| Droid | 52.5% ± 2.1% |
| Apex2 | 49.25% ± 1.39% |
| Codex CLI | 42.8% ± 2.1% |
| Terminus 2 | 41.3% ± 1.1% |
| Agent | Model | Accuracy |
|---|---|---|
| Droid | Claude Opus 4.1 | 58.8% ± 0.9% |
| Warp | Multiple | 52.0% ± 1.0% |
| Claude Code | Claude Opus 4 | 43.2% ± 1.3% |
| Droid | Claude Sonnet 4 | 50.5% ± 1.4% |
[Leaderboard screenshot to be inserted]
Prediction Phase
├── Task categorization
├── Key file identification
└── Multimodal requirement assessment
↓
Parallel Intelligence Gathering
├── Terminus-style execution (Episode 1)
├── Multi-round web search (3 rounds max)
├── Deep strategy generation
├── Heuristic environment observation
│ ├── Installed packages
│ ├── Folder structure
│ ├── Running processes
│ ├── System state
│ └── Key file contents
└── Exploration agent (explore unknowns from strategy)
└── Optional multimodal analysis on images and videos
↓
Strategy Synthesis (Combines all intelligence)
↓
Optimized Context Generation
↓
Main Execution (Episode 2+)
↓
Second Parallel Exploration (Turn 10)
↓
Continue with enriched context
We use Claude Sonnet 4.5 exclusively (with GPT-5 variant), which provides:
- Consistency - No coordination overhead between different models
- Simplicity - Easier debugging and iteration for model specific errors
- Cost efficiency - Lower operational costs with caching
Before any execution, we extract critical task metadata:
- Category classification - Determines risk profile and approach
- Key file identification - Extracts filenames mentioned in task description
- Multimodal assessment - Predicts if visual/document analysis needed
This upfront prediction enables targeted exploration and prevents wasted effort.
Our web search is a sophisticated multi-round research system using SERP instead of vendor provided websearch:
- Query Generation: Claude Sonnet 4.5 builds highly specific, low-frequency search terms
- Platform Bias: Prioritize GitHub/StackOverflow for actionable commands
- Google AI Overview: Extract Google's AI-generated summaries (remarkably effective)
- Deep Link Exploration: Analyze top 3 google links per query
- Multi-Round: Up to 3 rounds of searching and analysis
- Quality Control: Filter out any Terminal Bench mentions
- Low-frequency terms find exact solutions rather than generic tutorials
- Google AI Overview often contains highly actionable answer that is in the right direction
- GitHub/StackOverflow bias provides working code
- Multi-round searching handles complex, multi-step problems
Beyond simple ls, we perform targeted observation:
# What's installed?
pip list | grep -E "flask|django|tensorflow|torch"
# Folder structure
find . -type f -name "*.py" -o -name "*.txt" | head -20
# What's running?
ps aux | grep python
docker ps -a
# System state
df -h
free -m
# Key file contents (from prediction phase)
cat requirements.txt main.py This leads to exploration agent identifying critical unknowns to test via Docker container.
Strategy generation focuses on extracting everything the LLM knows:
- Knowledge extraction: prompt to surface related knowledge. The SOTA models really have the answers. You just need to extract the insights out of them.
- Alternative approaches: Two carefully thought-through command sequences
- Risk assessment: Identify high-consequence operations
- Common failures: Known failure modes for this task type and remediation strategies
After parallel intelligence gathering, we synthesize everything:
- Combine Episode 1 execution results
- Integrate web search findings
- Incorporate strategy alternatives
- Add environment discoveries
- Include Docker exploration results
This creates an optimized context for Episode 2 execution.
Rather than providing concrete commands/strategy, we focus on high-consequence operation management and common failure states:
- Key insight: Training runs can exceed 5 minutes
- Approach: Mandate parameter search before full runs
- Example: Test with small epochs first, verify shapes, then scale up
- Key insight: Many operations are irreversible
- Approach: Ground exact sequences before execution
- Example: Verify backups before attempting destructive commands
# ONLY use heredoc for file creation
cat << 'EOF' > app.py
def main():
print("Hello World")
EOF- Optimized prompt for Heredoc commands
- Automatic Heredoc repair
- Automatic indentation repair
- Proper escaping for special characters
- Detect stuck/lagging tmux sessions
- Automatic session recreation when degraded
- Preserve context across restarts
- Special prompts for operations exceeding 30 seconds
- Progress monitoring strategies
- Patience and timeout handling
Similar to Terminus, we have specific recovery prompts for execution errors (not strategy errors):
- Syntax errors in generated code
- Import errors
- Path/file not found
- Permission denied
- Connection timeouts
Prevents premature task completion by checking for:
- Parsing errors in output
- Execution errors in logs
- Incomplete test results
- Missing expected outputs
[To be filled with detailed ablation results]
-
Predictive Intelligence
- Early task understanding guides all subsequent actions
- Key file identification prevents missing critical context
- Multimodal prediction avoids unnecessary analysis
-
Google AI Overview in Search
- Often contains highly relevant solutions. better than web search capabilities provided by vendors
- Synthesizes multiple sources effectively
- Provides context beyond individual links
-
Strategy Synthesis
- Combining all parallel intelligence crucial
- Optimized context dramatically improves Episode 2
- Prevents information silos
-
Execution Robustness
- Heredoc handling eliminated file creation errors
- Recovery prompts handle common execution failures
- Validation prevents false completions
Initially, parallelization was critical because:
- Early versions had slow starts
- Timeout issues were common
With optimization:
- Strategy and web search now highly effective
- Agent makes good progress from Episode 1
- Parallelization provides diverse perspectives rather than multiple attempts
- Generic search terms - Too many irrelevant results
- Single-round searching - Missed complex solutions
- Ignoring Google AI Overview - Lost valuable synthesis
- Early completion without validation - False positives
- Token Usage: Dramatically reduced through Claude Sonnet 4.5 caching
- Speed: 2.3 min average completion
- Reliability: Low variance (±1.4%) across runs
- Recovery: Handles 90% of execution errors
Terminal Bench requires:
- Deep task understanding before execution
- Real-world knowledge for frameworks
- Risk management for irreversible operations
- Robust execution handling
Apex2 addresses each through:
- Predictive intelligence phase
- Sophisticated web search
- Risk-aware prompting
- Execution optimization suite
- Predict → Explore → Synthesize → Execute
- Low-frequency search beats generic queries
- Risk awareness beats blind execution
- Recovery capabilities enable bold strategies
- Details matter (heredoc, validation, session health)
While others use parallel execution for multiple attempts, we use it for diverse intelligence gathering:
- Execution provides quick feedbacks for obvious solution from env
- Web search provides solutions
- Strategy provides alternatives
- Environment provides context for immediate execution
- Exploration using Docker Execution provides some answers to unknowns in strategy without adding context in panes
The synthesis of these perspectives creates superior execution context.
This work demonstrates:
- Upfront websearch, exploration, strategy generation dramatically improves efficiency
- Google AI Overview is an underutilized resource
- Strategy synthesis beats isolated intelligence
- Execution details matter as much as high-level strategy
As an LLM Orchestration Lead at Roblox, I've learned that production systems need both broad intelligence and meticulous execution. This project combines sophisticated information gathering with careful attention to execution details.
The combination of predictive intelligence, multi-round search with Google AI Overview, and strategy synthesis creates a system that's both intelligent and reliable.
- Enhanced prediction models - Better task understanding upfront
- Search query optimization - Learn optimal query patterns
- Automated recovery generation - Build recovery prompts from failures
- Cross-task learning - Share successful patterns
Apex2 achieves SOTA through intelligent information gathering, sophisticated synthesis, and meticulous execution. By combining predictive intelligence, multi-round search, and careful strategy synthesis with robust execution handling, we built a system that understands deeply and executes reliably.
Key insight: In agentic coding, the combination of diverse intelligence gathering, sophisticated synthesis, and execution robustness beats any single optimization. Success comes from both knowing what to do and doing it reliably.
Built during weekends while exploring the limits of agentic systems. Thanks to Stanford's Terminal Bench team for creating this excellent benchmark.
