feat: add Terminal-Bench benchmark integration by neubig · Pull Request #427 · OpenHands/benchmarks

neubig · 2026-02-17T23:46:04Z

Summary

This PR integrates Terminal-Bench evaluation into the benchmarks repository using Harbor with the openhands-sdk agent.

Fixes #48

Changes

New Module: `benchmarks/terminalbench/`

run_infer.py: Runs Terminal-Bench evaluation via Harbor CLI with openhands-sdk agent
- Supports full dataset evaluation or specific task selection
- Converts ATIF trajectories to standard JSONL format
- Configurable timeouts and parallelism
eval_infer.py: Processes results and generates evaluation reports
- Aggregates metrics (cost, tokens, pass rate)
- Produces SWE-Bench-style report JSON
config.py: Default configuration for Terminal-Bench evaluation
README.md: Documentation with usage examples

CLI Entrypoints

terminalbench-infer: Run Terminal-Bench inference
terminalbench-eval: Process and report results

Tests

Added comprehensive tests covering:

Result processing and aggregation
Metrics calculation
Error handling
ATIF trajectory parsing

Usage

# Run full evaluation
uv run terminalbench-infer .llm_config/claude.json

# Run specific tasks
uv run terminalbench-infer .llm_config/claude.json --task-id hello-world

# Evaluate results
uv run terminalbench-eval ./evaluation_outputs/.../output.jsonl

Architecture

The integration uses:

Harbor: Official Terminal-Bench harness for task management
openhands-sdk agent: Runs inside task containers (merged in harbor#738)
ATIF format: Agent Trajectory Interchange Format for results

Testing

All tests pass:

tests/test_terminalbench.py: 9 passed

Pre-commit checks pass:

Ruff format ✅
Ruff lint ✅
PEP8 style ✅
Pyright type check ✅

@neubig can click here to continue refining the PR

Evidence

✅ Full E2E Run Verified (3 tasks, real LLM calls, verified results)

Ran 3 Terminal-Bench sample tasks with claude-sonnet-4-20250514 via the LiteLLM proxy. All 3 completed with real agent execution — no auth errors, no exceptions, and verifier results are confirmed correct.

1. terminalbench-infer — Harbor runs agent in Docker containers:

$ uv run terminalbench-infer llm_config.json \
    --dataset terminal-bench-sample --num-workers 1 \
    --task-id chess-best-move --task-id polyglot-c-py --task-id sqlite-with-gcov

INFO  Harbor evaluation completed successfully
INFO  stdout: 3/3 Mean: 0.000
      Trials: 3 | Errors: 0 | Mean: 0.000
INFO  Found 3 trial results
INFO  Processed trial chess-best-move: reward=0.0
INFO  Processed trial polyglot-c-py: reward=0.0
INFO  Processed trial sqlite-with-gcov: reward=0.0
INFO  Wrote 3 successful + 0 failed entries to .../output.jsonl

2. Agent logs confirm real LLM calls:

Task	Input Tokens	Output Tokens	Cost
chess-best-move	63,364	2,322	$0.09
polyglot-c-py	2,247,246	16,692	$1.05
sqlite-with-gcov	637,198	4,346	$0.34

3. Verifier results confirmed correct (all are genuine agent failures):

chess-best-move: Agent wrote d2d4 but the correct checkmate-in-one moves are g2g4 and e2e4. Wrong move → reward=0.0 ✅
polyglot-c-py: Agent created a working polyglot file but left a compiled cmain binary in /app/polyglot/. Verifier expected only main.py.c in that directory → reward=0.0 ✅
sqlite-with-gcov: Agent compiled SQLite with gcov to /app/sqlite/ but didn't add it to PATH. Verifier runs sqlite3 directly → FileNotFoundError → reward=0.0 ✅

4. terminalbench-eval processes results correctly:

$ uv run terminalbench-eval .../output.jsonl
INFO  Total instances: 3
INFO  Completed instances: 3
INFO  Resolved instances: 0
INFO  Error instances: 0
INFO  Success rate: 0.0%
INFO  Total cost: $1.4845

5. All 15 unit tests pass:

$ uv run pytest tests/test_terminalbench.py -v
TestProcessTerminalbenchResults (8 tests) — all PASSED
TestConvertHarborToEvalOutput (7 tests) — all PASSED
15 passed

6. Pre-commit checks pass:

Ruff format ✅ | Ruff lint ✅ | PEP8 style ✅ | Pyright type check ✅

Changes from review (commit `54a3d53`)

Fixed harbor CLI flags: --output-dir → --jobs-dir, --max-workers → --n-concurrent, --task-id → --task-name
Pass LLM credentials via --ae (agent env) flags instead of subprocess env
Use absolute path for --jobs-dir to avoid CWD-relative path issues
Removed unused --timeout parameter
Rewrote convert_harbor_to_eval_output for actual harbor output structure (TIMESTAMP/TRIAL_NAME/result.json)
Updated all tests to match new harbor output structure

all-hands-bot

Taste Rating: 🟡 Acceptable - Core logic is sound, but has data structure inefficiencies and over-engineering for imaginary problems.

all-hands-bot

Taste Rating: 🟡 Acceptable - Solid implementation with good error handling. Previous review issues properly addressed. Only minor cleanup needed.

all-hands-bot

Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor edge cases remain.

all-hands-bot

Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor cleanup opportunity remains.

VERDICT: ✅ Worth merging - Core integration is pragmatic and well-tested.

KEY INSIGHT: This is a clean, straightforward benchmark integration that avoids over-engineering. The use of sets for deduplication and fail-fast error handling shows good engineering judgment.

all-hands-bot

Taste Rating: 🟡 Acceptable - Solid integration, but solving imaginary problems with unused config.

VERDICT: ✅ Worth merging - Core logic is sound. Optional cleanup of dead config suggested.

KEY INSIGHT: You're maintaining config for hypothetical future features instead of building what you need today. Good code removes unused abstractions.

all-hands-bot

Clean, pragmatic benchmark integration. Previous reviews addressed all major concerns - the code is now production-ready with solid test coverage and appropriate error handling. ✅

Integrate Terminal-Bench evaluation using Harbor with openhands-sdk agent. - Add benchmarks/terminalbench module with: - run_infer.py: Runs Terminal-Bench via Harbor CLI with openhands-sdk agent - eval_infer.py: Processes results and generates evaluation reports - config.py: Default configuration for Terminal-Bench evaluation - README.md: Documentation with usage examples - Add CLI entrypoints: - terminalbench-infer: Run Terminal-Bench inference - terminalbench-eval: Process and report results - Add tests covering: - Result processing and aggregation - Metrics calculation - Error handling - ATIF trajectory parsing The integration uses the openhands-sdk agent adapter in Harbor (PR #738) to run tasks in Terminal-Bench containers and collects ATIF trajectories. Closes #48 Co-authored-by: openhands <openhands@all-hands.dev>

- Refactor eval_infer.py to use sets instead of redundant list+set pattern - O(1) lookups and automatic deduplication - Convert to sorted lists only in final report for consistent output - Simplify trajectory discovery in run_infer.py - Fail fast if Harbor trials/ directory not found - Remove fallback rglob scanning that masks errors - Improve error handling in convert_harbor_to_eval_output - Record errors per trajectory instead of silent skip - Fail loudly if all trajectories fail to parse - Include error entries in output for visibility - Remove unused TASK_CATEGORIES from config.py - Add comprehensive tests for convert_harbor_to_eval_output: - test_successful_trajectory_parsing - test_malformed_trajectory_handling - test_mixed_valid_invalid_trajectories - test_empty_harbor_output_directory - test_missing_trials_directory - test_trajectory_discovery_finds_all_trials Co-authored-by: openhands <openhands@all-hands.dev>

- Use harbor_executable from config instead of hardcoded 'harbor' - Extract instruction from first user step instead of assuming steps[0] Co-authored-by: openhands <openhands@all-hands.dev>

- Removed EVAL_DEFAULTS (never used) - Removed unused INFER_DEFAULTS entries: split, n_limit, max_iterations, max_attempts, max_retries, workspace - Added comment explaining error file handling purpose Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - Previous reviews addressed major concerns. One critical bug remains in metrics aggregation.

VERDICT: ⚠️ Needs fix before merging - The or fallback will corrupt cost tracking when values are legitimately zero.

neubig · 2026-03-02T12:39:40Z

@OpenHands run the full harbor eval run on one eval instance (make sure to start dockerd first), and also resolve all review comments using the github sparql API

openhands-ai · 2026-03-02T12:40:01Z

I'm on it! neubig can track my progress at all-hands.dev

- Fix logic bug in eval_infer.py: use explicit None check instead of 'or' fallback for cost/token aggregation (breaks when values are legitimately 0) - Move SecretStr import to top of run_infer.py per project guidelines Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟢 Good taste - Clean, pragmatic integration that avoids over-engineering.

VERDICT: ✅ Worth merging - Previous reviews addressed all major concerns. Code is production-ready with solid test coverage.

KEY INSIGHT: After multiple review rounds, this is now a textbook example of clean benchmark integration - fail-fast error handling, appropriate data structures (sets for O(1) deduplication), and comprehensive tests of real behavior rather than mocks.

Add try/except blocks around LaminarService.update_evaluation_scores() and generate_cost_report() calls to ensure multi-hour evaluations succeed even if these non-critical telemetry utilities fail. Co-authored-by: openhands <openhands@all-hands.dev>

- Wrap Laminar telemetry and cost reporting in try/except blocks so expensive multi-hour evaluations don't fail at the telemetry step after completing - Remove redundant None checks after explicit None handling on cost/tokens Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - Solid integration after extensive refinement. One runtime risk remains.

VERDICT: ⚠️ Recommend validation before merge - Credential propagation mechanism needs E2E confirmation.

KEY INSIGHT: After 15+ review rounds, the code is clean and well-tested, but the Harbor container credential passing is untested and could cause silent auth failures.

openhands-ai · 2026-03-02T13:10:09Z

Final Summary

Changes Made and Pushed (commit `f8beb28`)

1. Fixed Critical Logic Bug in eval_infer.py (lines 119-141)

Changed from or fallback pattern to explicit None checks for cost/token aggregation
The previous code cost = metrics.get("total_cost_usd") or final_metrics.get(...) would incorrectly fallback when values were legitimately 0
Now uses if cost is None: cost = final_metrics.get(...) pattern

2. Moved Inline Import in run_infer.py

Moved from pydantic import SecretStr from line 98 (inside function) to the top of the file with other imports
Follows project guideline: "Place all imports at the top of the file"

Review Thread Resolution

Replied to and resolved both unresolved review threads via GitHub GraphQL API
All 16 review threads on PR feat: add Terminal-Bench benchmark integration #427 are now resolved

Harbor Evaluation Attempt

Successfully started dockerd and ran harbor run -d terminal-bench-sample -a openhands-sdk --n-tasks 1
Harbor correctly recognized the openhands-sdk agent
Task container failed during setup due to network restrictions (apt-get couldn't reach Ubuntu repos)
This is an infrastructure limitation, not a code issue

Checklist

Fixed logic bug for zero value handling
Moved inline import to top of file
All tests pass (15/15)
Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
Changes committed and pushed to openhands/terminal-bench-integration
Both review threads replied to and resolved
Attempted harbor evaluation (infra limitations prevented full completion)

The changes are minimal and focused - only 20 insertions and 14 deletions addressing exactly what the review comments requested.

- Replace --output-dir with --jobs-dir (harbor's actual flag) - Replace --max-workers with --n-concurrent - Replace --task-id with --task-name - Pass LLM credentials via --ae (agent env) instead of process env - Use absolute path for --jobs-dir to avoid CWD issues - Remove unused --timeout param (harbor uses --timeout-multiplier) - Rewrite convert_harbor_to_eval_output for actual harbor structure: harbor stores trial results in TIMESTAMP/TRIAL_NAME/result.json, not trials/*/trajectory.json - Update tests to match new harbor output structure Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-08T14:51:47Z

I confirmed that this works in my local terminal, merging in.

openhands-ai Bot mentioned this pull request Feb 17, 2026

benchmark: terminal_bench #48

Closed

neubig marked this pull request as ready for review March 1, 2026 13:24

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/eval_infer.py Outdated

Comment thread benchmarks/terminalbench/run_infer.py Outdated

Comment thread benchmarks/terminalbench/run_infer.py

Comment thread benchmarks/terminalbench/config.py Outdated

Comment thread tests/test_terminalbench.py Outdated

neubig marked this pull request as draft March 1, 2026 13:28

neubig marked this pull request as ready for review March 1, 2026 13:52

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/config.py

Comment thread benchmarks/terminalbench/run_infer.py

neubig marked this pull request as draft March 1, 2026 14:13

neubig marked this pull request as ready for review March 1, 2026 14:19

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/run_infer.py

Comment thread benchmarks/terminalbench/run_infer.py Outdated

Comment thread tests/test_terminalbench.py

neubig marked this pull request as draft March 1, 2026 21:25

neubig marked this pull request as ready for review March 1, 2026 21:29

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/eval_infer.py

neubig marked this pull request as draft March 2, 2026 01:52

neubig marked this pull request as ready for review March 2, 2026 01:58

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/config.py

Comment thread benchmarks/terminalbench/config.py Outdated

Comment thread benchmarks/terminalbench/eval_infer.py

neubig marked this pull request as draft March 2, 2026 02:24

neubig marked this pull request as ready for review March 2, 2026 03:36

all-hands-bot approved these changes Mar 2, 2026

View reviewed changes

neubig marked this pull request as draft March 2, 2026 12:22

openhands-agent added 3 commits March 2, 2026 12:29

Address bot review comments on Terminal-Bench

1650e1b

- Use harbor_executable from config instead of hardcoded 'harbor' - Extract instruction from first user step instead of assuming steps[0] Co-authored-by: openhands <openhands@all-hands.dev>

neubig force-pushed the openhands/terminal-bench-integration branch from 6431d1d to 32dc5a1 Compare March 2, 2026 12:29

neubig force-pushed the openhands/terminal-bench-integration branch from 32dc5a1 to 8a89633 Compare March 2, 2026 12:33

neubig marked this pull request as ready for review March 2, 2026 12:35

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/eval_infer.py Outdated

Comment thread benchmarks/terminalbench/run_infer.py Outdated

neubig marked this pull request as draft March 2, 2026 12:46

neubig marked this pull request as ready for review March 2, 2026 12:47

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/eval_infer.py

Comment thread benchmarks/terminalbench/eval_infer.py Outdated

Comment thread tests/test_terminalbench.py

neubig marked this pull request as draft March 2, 2026 12:51

openhands-agent added 2 commits March 2, 2026 13:00

neubig marked this pull request as ready for review March 2, 2026 13:02

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

Comment thread benchmarks/terminalbench/run_infer.py Outdated

Comment thread benchmarks/terminalbench/run_infer.py

Comment thread benchmarks/terminalbench/eval_infer.py

Comment thread tests/test_terminalbench.py

neubig merged commit 7b4b34c into main Mar 8, 2026
2 checks passed

Conversation

neubig commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Module: benchmarks/terminalbench/

CLI Entrypoints

Tests

Usage

Architecture

Testing

Evidence

✅ Full E2E Run Verified (3 tasks, real LLM calls, verified results)

Changes from review (commit 54a3d53)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

neubig commented Mar 2, 2026

Uh oh!

openhands-ai Bot commented Mar 2, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openhands-ai Bot commented Mar 2, 2026

Final Summary

Changes Made and Pushed (commit f8beb28)

Review Thread Resolution

Harbor Evaluation Attempt

Checklist

Uh oh!

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

Reviewers

neubig commented Feb 17, 2026 •

edited

Loading

New Module: `benchmarks/terminalbench/`

Changes from review (commit `54a3d53`)

Changes Made and Pushed (commit `f8beb28`)