Skip to content

feat: add Terminal-Bench benchmark integration#427

Merged
neubig merged 8 commits intomainfrom
openhands/terminal-bench-integration
Mar 8, 2026
Merged

feat: add Terminal-Bench benchmark integration#427
neubig merged 8 commits intomainfrom
openhands/terminal-bench-integration

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented Feb 17, 2026

Summary

This PR integrates Terminal-Bench evaluation into the benchmarks repository using Harbor with the openhands-sdk agent.

Fixes #48

Changes

New Module: benchmarks/terminalbench/

  • run_infer.py: Runs Terminal-Bench evaluation via Harbor CLI with openhands-sdk agent

    • Supports full dataset evaluation or specific task selection
    • Converts ATIF trajectories to standard JSONL format
    • Configurable timeouts and parallelism
  • eval_infer.py: Processes results and generates evaluation reports

    • Aggregates metrics (cost, tokens, pass rate)
    • Produces SWE-Bench-style report JSON
  • config.py: Default configuration for Terminal-Bench evaluation

  • README.md: Documentation with usage examples

CLI Entrypoints

  • terminalbench-infer: Run Terminal-Bench inference
  • terminalbench-eval: Process and report results

Tests

Added comprehensive tests covering:

  • Result processing and aggregation
  • Metrics calculation
  • Error handling
  • ATIF trajectory parsing

Usage

# Run full evaluation
uv run terminalbench-infer .llm_config/claude.json

# Run specific tasks
uv run terminalbench-infer .llm_config/claude.json --task-id hello-world

# Evaluate results
uv run terminalbench-eval ./evaluation_outputs/.../output.jsonl

Architecture

The integration uses:

  • Harbor: Official Terminal-Bench harness for task management
  • openhands-sdk agent: Runs inside task containers (merged in harbor#738)
  • ATIF format: Agent Trajectory Interchange Format for results

Testing

All tests pass:

tests/test_terminalbench.py: 9 passed

Pre-commit checks pass:

  • Ruff format ✅
  • Ruff lint ✅
  • PEP8 style ✅
  • Pyright type check ✅

@neubig can click here to continue refining the PR


Evidence

✅ Full E2E Run Verified (3 tasks, real LLM calls, verified results)

Ran 3 Terminal-Bench sample tasks with claude-sonnet-4-20250514 via the LiteLLM proxy. All 3 completed with real agent execution — no auth errors, no exceptions, and verifier results are confirmed correct.

1. terminalbench-infer — Harbor runs agent in Docker containers:

$ uv run terminalbench-infer llm_config.json \
    --dataset terminal-bench-sample --num-workers 1 \
    --task-id chess-best-move --task-id polyglot-c-py --task-id sqlite-with-gcov

INFO  Harbor evaluation completed successfully
INFO  stdout: 3/3 Mean: 0.000
      Trials: 3 | Errors: 0 | Mean: 0.000
INFO  Found 3 trial results
INFO  Processed trial chess-best-move: reward=0.0
INFO  Processed trial polyglot-c-py: reward=0.0
INFO  Processed trial sqlite-with-gcov: reward=0.0
INFO  Wrote 3 successful + 0 failed entries to .../output.jsonl

2. Agent logs confirm real LLM calls:

Task Input Tokens Output Tokens Cost Reward
chess-best-move 63,364 2,322 $0.09 0.0
polyglot-c-py 2,247,246 16,692 $1.05 0.0
sqlite-with-gcov 637,198 4,346 $0.34 0.0

3. Verifier results confirmed correct (all are genuine agent failures):

  • chess-best-move: Agent wrote d2d4 but the correct checkmate-in-one moves are g2g4 and e2e4. Wrong move → reward=0.0
  • polyglot-c-py: Agent created a working polyglot file but left a compiled cmain binary in /app/polyglot/. Verifier expected only main.py.c in that directory → reward=0.0
  • sqlite-with-gcov: Agent compiled SQLite with gcov to /app/sqlite/ but didn't add it to PATH. Verifier runs sqlite3 directly → FileNotFoundErrorreward=0.0

4. terminalbench-eval processes results correctly:

$ uv run terminalbench-eval .../output.jsonl
INFO  Total instances: 3
INFO  Completed instances: 3
INFO  Resolved instances: 0
INFO  Error instances: 0
INFO  Success rate: 0.0%
INFO  Total cost: $1.4845

5. All 15 unit tests pass:

$ uv run pytest tests/test_terminalbench.py -v
TestProcessTerminalbenchResults (8 tests) — all PASSED
TestConvertHarborToEvalOutput (7 tests) — all PASSED
15 passed

6. Pre-commit checks pass:

  • Ruff format ✅ | Ruff lint ✅ | PEP8 style ✅ | Pyright type check ✅

Changes from review (commit 54a3d53)

  • Fixed harbor CLI flags: --output-dir--jobs-dir, --max-workers--n-concurrent, --task-id--task-name
  • Pass LLM credentials via --ae (agent env) flags instead of subprocess env
  • Use absolute path for --jobs-dir to avoid CWD-relative path issues
  • Removed unused --timeout parameter
  • Rewrote convert_harbor_to_eval_output for actual harbor output structure (TIMESTAMP/TRIAL_NAME/result.json)
  • Updated all tests to match new harbor output structure

@openhands-ai openhands-ai Bot mentioned this pull request Feb 17, 2026
@neubig neubig marked this pull request as ready for review March 1, 2026 13:24
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Core logic is sound, but has data structure inefficiencies and over-engineering for imaginary problems.

Comment thread benchmarks/terminalbench/eval_infer.py Outdated
Comment thread benchmarks/terminalbench/run_infer.py Outdated
Comment thread benchmarks/terminalbench/run_infer.py
Comment thread benchmarks/terminalbench/config.py Outdated
Comment thread tests/test_terminalbench.py Outdated
@neubig neubig marked this pull request as draft March 1, 2026 13:28
@neubig neubig marked this pull request as ready for review March 1, 2026 13:52
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid implementation with good error handling. Previous review issues properly addressed. Only minor cleanup needed.

Comment thread benchmarks/terminalbench/config.py
Comment thread benchmarks/terminalbench/run_infer.py
@neubig neubig marked this pull request as draft March 1, 2026 14:13
@neubig neubig marked this pull request as ready for review March 1, 2026 14:19
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor edge cases remain.

Comment thread benchmarks/terminalbench/run_infer.py
Comment thread benchmarks/terminalbench/run_infer.py Outdated
Comment thread tests/test_terminalbench.py
@neubig neubig marked this pull request as draft March 1, 2026 21:25
@neubig neubig marked this pull request as ready for review March 1, 2026 21:29
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor cleanup opportunity remains.

VERDICT: ✅ Worth merging - Core integration is pragmatic and well-tested.

KEY INSIGHT: This is a clean, straightforward benchmark integration that avoids over-engineering. The use of sets for deduplication and fail-fast error handling shows good engineering judgment.

Comment thread benchmarks/terminalbench/eval_infer.py
@neubig neubig marked this pull request as draft March 2, 2026 01:52
@neubig neubig marked this pull request as ready for review March 2, 2026 01:58
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid integration, but solving imaginary problems with unused config.

VERDICT: ✅ Worth merging - Core logic is sound. Optional cleanup of dead config suggested.

KEY INSIGHT: You're maintaining config for hypothetical future features instead of building what you need today. Good code removes unused abstractions.

Comment thread benchmarks/terminalbench/config.py
Comment thread benchmarks/terminalbench/config.py Outdated
Comment thread benchmarks/terminalbench/eval_infer.py
@neubig neubig marked this pull request as draft March 2, 2026 02:24
@neubig neubig marked this pull request as ready for review March 2, 2026 03:36
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, pragmatic benchmark integration. Previous reviews addressed all major concerns - the code is now production-ready with solid test coverage and appropriate error handling. ✅

@neubig neubig marked this pull request as draft March 2, 2026 12:22
Integrate Terminal-Bench evaluation using Harbor with openhands-sdk agent.

- Add benchmarks/terminalbench module with:
  - run_infer.py: Runs Terminal-Bench via Harbor CLI with openhands-sdk agent
  - eval_infer.py: Processes results and generates evaluation reports
  - config.py: Default configuration for Terminal-Bench evaluation
  - README.md: Documentation with usage examples

- Add CLI entrypoints:
  - terminalbench-infer: Run Terminal-Bench inference
  - terminalbench-eval: Process and report results

- Add tests covering:
  - Result processing and aggregation
  - Metrics calculation
  - Error handling
  - ATIF trajectory parsing

The integration uses the openhands-sdk agent adapter in Harbor (PR #738)
to run tasks in Terminal-Bench containers and collects ATIF trajectories.

Closes #48

Co-authored-by: openhands <openhands@all-hands.dev>
- Refactor eval_infer.py to use sets instead of redundant list+set pattern
  - O(1) lookups and automatic deduplication
  - Convert to sorted lists only in final report for consistent output

- Simplify trajectory discovery in run_infer.py
  - Fail fast if Harbor trials/ directory not found
  - Remove fallback rglob scanning that masks errors

- Improve error handling in convert_harbor_to_eval_output
  - Record errors per trajectory instead of silent skip
  - Fail loudly if all trajectories fail to parse
  - Include error entries in output for visibility

- Remove unused TASK_CATEGORIES from config.py

- Add comprehensive tests for convert_harbor_to_eval_output:
  - test_successful_trajectory_parsing
  - test_malformed_trajectory_handling
  - test_mixed_valid_invalid_trajectories
  - test_empty_harbor_output_directory
  - test_missing_trials_directory
  - test_trajectory_discovery_finds_all_trials

Co-authored-by: openhands <openhands@all-hands.dev>
- Use harbor_executable from config instead of hardcoded 'harbor'
- Extract instruction from first user step instead of assuming steps[0]

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the openhands/terminal-bench-integration branch from 6431d1d to 32dc5a1 Compare March 2, 2026 12:29
- Removed EVAL_DEFAULTS (never used)
- Removed unused INFER_DEFAULTS entries: split, n_limit, max_iterations,
  max_attempts, max_retries, workspace
- Added comment explaining error file handling purpose

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the openhands/terminal-bench-integration branch from 32dc5a1 to 8a89633 Compare March 2, 2026 12:33
@neubig neubig marked this pull request as ready for review March 2, 2026 12:35
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Previous reviews addressed major concerns. One critical bug remains in metrics aggregation.

VERDICT: ⚠️ Needs fix before merging - The or fallback will corrupt cost tracking when values are legitimately zero.

Comment thread benchmarks/terminalbench/eval_infer.py Outdated
Comment thread benchmarks/terminalbench/run_infer.py Outdated
@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented Mar 2, 2026

@OpenHands run the full harbor eval run on one eval instance (make sure to start dockerd first), and also resolve all review comments using the github sparql API

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Mar 2, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Fix logic bug in eval_infer.py: use explicit None check instead of 'or'
  fallback for cost/token aggregation (breaks when values are legitimately 0)
- Move SecretStr import to top of run_infer.py per project guidelines

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as draft March 2, 2026 12:46
@neubig neubig marked this pull request as ready for review March 2, 2026 12:47
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste - Clean, pragmatic integration that avoids over-engineering.

VERDICT:Worth merging - Previous reviews addressed all major concerns. Code is production-ready with solid test coverage.

KEY INSIGHT: After multiple review rounds, this is now a textbook example of clean benchmark integration - fail-fast error handling, appropriate data structures (sets for O(1) deduplication), and comprehensive tests of real behavior rather than mocks.

Comment thread benchmarks/terminalbench/eval_infer.py
Comment thread benchmarks/terminalbench/eval_infer.py Outdated
Comment thread tests/test_terminalbench.py
@neubig neubig marked this pull request as draft March 2, 2026 12:51
Add try/except blocks around LaminarService.update_evaluation_scores()
and generate_cost_report() calls to ensure multi-hour evaluations
succeed even if these non-critical telemetry utilities fail.

Co-authored-by: openhands <openhands@all-hands.dev>
- Wrap Laminar telemetry and cost reporting in try/except blocks so expensive
  multi-hour evaluations don't fail at the telemetry step after completing
- Remove redundant None checks after explicit None handling on cost/tokens

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 2, 2026 13:02
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid integration after extensive refinement. One runtime risk remains.

VERDICT: ⚠️ Recommend validation before merge - Credential propagation mechanism needs E2E confirmation.

KEY INSIGHT: After 15+ review rounds, the code is clean and well-tested, but the Harbor container credential passing is untested and could cause silent auth failures.

Comment thread benchmarks/terminalbench/run_infer.py Outdated
Comment thread benchmarks/terminalbench/run_infer.py
Comment thread benchmarks/terminalbench/eval_infer.py
Comment thread tests/test_terminalbench.py
@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Mar 2, 2026

Final Summary

Changes Made and Pushed (commit f8beb28)

1. Fixed Critical Logic Bug in eval_infer.py (lines 119-141)

  • Changed from or fallback pattern to explicit None checks for cost/token aggregation
  • The previous code cost = metrics.get("total_cost_usd") or final_metrics.get(...) would incorrectly fallback when values were legitimately 0
  • Now uses if cost is None: cost = final_metrics.get(...) pattern

2. Moved Inline Import in run_infer.py

  • Moved from pydantic import SecretStr from line 98 (inside function) to the top of the file with other imports
  • Follows project guideline: "Place all imports at the top of the file"

Review Thread Resolution

Harbor Evaluation Attempt

  • Successfully started dockerd and ran harbor run -d terminal-bench-sample -a openhands-sdk --n-tasks 1
  • Harbor correctly recognized the openhands-sdk agent
  • Task container failed during setup due to network restrictions (apt-get couldn't reach Ubuntu repos)
  • This is an infrastructure limitation, not a code issue

Checklist

  • Fixed logic bug for zero value handling
  • Moved inline import to top of file
  • All tests pass (15/15)
  • Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
  • Changes committed and pushed to openhands/terminal-bench-integration
  • Both review threads replied to and resolved
  • Attempted harbor evaluation (infra limitations prevented full completion)

The changes are minimal and focused - only 20 insertions and 14 deletions addressing exactly what the review comments requested.

- Replace --output-dir with --jobs-dir (harbor's actual flag)
- Replace --max-workers with --n-concurrent
- Replace --task-id with --task-name
- Pass LLM credentials via --ae (agent env) instead of process env
- Use absolute path for --jobs-dir to avoid CWD issues
- Remove unused --timeout param (harbor uses --timeout-multiplier)
- Rewrite convert_harbor_to_eval_output for actual harbor structure:
  harbor stores trial results in TIMESTAMP/TRIAL_NAME/result.json,
  not trials/*/trajectory.json
- Update tests to match new harbor output structure

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig merged commit 7b4b34c into main Mar 8, 2026
2 checks passed
@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented Mar 8, 2026

I confirmed that this works in my local terminal, merging in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmark: terminal_bench

3 participants