feat: add Terminal-Bench benchmark integration#427
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Core logic is sound, but has data structure inefficiencies and over-engineering for imaginary problems.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid implementation with good error handling. Previous review issues properly addressed. Only minor cleanup needed.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor edge cases remain.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Previous review issues properly addressed. Solid implementation with good error handling and appropriate data structures. Only minor cleanup opportunity remains.
VERDICT: ✅ Worth merging - Core integration is pragmatic and well-tested.
KEY INSIGHT: This is a clean, straightforward benchmark integration that avoids over-engineering. The use of sets for deduplication and fail-fast error handling shows good engineering judgment.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid integration, but solving imaginary problems with unused config.
VERDICT: ✅ Worth merging - Core logic is sound. Optional cleanup of dead config suggested.
KEY INSIGHT: You're maintaining config for hypothetical future features instead of building what you need today. Good code removes unused abstractions.
all-hands-bot
left a comment
There was a problem hiding this comment.
Clean, pragmatic benchmark integration. Previous reviews addressed all major concerns - the code is now production-ready with solid test coverage and appropriate error handling. ✅
Integrate Terminal-Bench evaluation using Harbor with openhands-sdk agent. - Add benchmarks/terminalbench module with: - run_infer.py: Runs Terminal-Bench via Harbor CLI with openhands-sdk agent - eval_infer.py: Processes results and generates evaluation reports - config.py: Default configuration for Terminal-Bench evaluation - README.md: Documentation with usage examples - Add CLI entrypoints: - terminalbench-infer: Run Terminal-Bench inference - terminalbench-eval: Process and report results - Add tests covering: - Result processing and aggregation - Metrics calculation - Error handling - ATIF trajectory parsing The integration uses the openhands-sdk agent adapter in Harbor (PR #738) to run tasks in Terminal-Bench containers and collects ATIF trajectories. Closes #48 Co-authored-by: openhands <openhands@all-hands.dev>
- Refactor eval_infer.py to use sets instead of redundant list+set pattern - O(1) lookups and automatic deduplication - Convert to sorted lists only in final report for consistent output - Simplify trajectory discovery in run_infer.py - Fail fast if Harbor trials/ directory not found - Remove fallback rglob scanning that masks errors - Improve error handling in convert_harbor_to_eval_output - Record errors per trajectory instead of silent skip - Fail loudly if all trajectories fail to parse - Include error entries in output for visibility - Remove unused TASK_CATEGORIES from config.py - Add comprehensive tests for convert_harbor_to_eval_output: - test_successful_trajectory_parsing - test_malformed_trajectory_handling - test_mixed_valid_invalid_trajectories - test_empty_harbor_output_directory - test_missing_trials_directory - test_trajectory_discovery_finds_all_trials Co-authored-by: openhands <openhands@all-hands.dev>
- Use harbor_executable from config instead of hardcoded 'harbor' - Extract instruction from first user step instead of assuming steps[0] Co-authored-by: openhands <openhands@all-hands.dev>
6431d1d to
32dc5a1
Compare
- Removed EVAL_DEFAULTS (never used) - Removed unused INFER_DEFAULTS entries: split, n_limit, max_iterations, max_attempts, max_retries, workspace - Added comment explaining error file handling purpose Co-authored-by: openhands <openhands@all-hands.dev>
32dc5a1 to
8a89633
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Previous reviews addressed major concerns. One critical bug remains in metrics aggregation.
VERDICT: or fallback will corrupt cost tracking when values are legitimately zero.
|
@OpenHands run the full harbor eval run on one eval instance (make sure to start dockerd first), and also resolve all review comments using the github sparql API |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Fix logic bug in eval_infer.py: use explicit None check instead of 'or' fallback for cost/token aggregation (breaks when values are legitimately 0) - Move SecretStr import to top of run_infer.py per project guidelines Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Clean, pragmatic integration that avoids over-engineering.
VERDICT: ✅ Worth merging - Previous reviews addressed all major concerns. Code is production-ready with solid test coverage.
KEY INSIGHT: After multiple review rounds, this is now a textbook example of clean benchmark integration - fail-fast error handling, appropriate data structures (sets for O(1) deduplication), and comprehensive tests of real behavior rather than mocks.
Add try/except blocks around LaminarService.update_evaluation_scores() and generate_cost_report() calls to ensure multi-hour evaluations succeed even if these non-critical telemetry utilities fail. Co-authored-by: openhands <openhands@all-hands.dev>
- Wrap Laminar telemetry and cost reporting in try/except blocks so expensive multi-hour evaluations don't fail at the telemetry step after completing - Remove redundant None checks after explicit None handling on cost/tokens Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid integration after extensive refinement. One runtime risk remains.
VERDICT:
KEY INSIGHT: After 15+ review rounds, the code is clean and well-tested, but the Harbor container credential passing is untested and could cause silent auth failures.
Final SummaryChanges Made and Pushed (commit
|
- Replace --output-dir with --jobs-dir (harbor's actual flag) - Replace --max-workers with --n-concurrent - Replace --task-id with --task-name - Pass LLM credentials via --ae (agent env) instead of process env - Use absolute path for --jobs-dir to avoid CWD issues - Remove unused --timeout param (harbor uses --timeout-multiplier) - Rewrite convert_harbor_to_eval_output for actual harbor structure: harbor stores trial results in TIMESTAMP/TRIAL_NAME/result.json, not trials/*/trajectory.json - Update tests to match new harbor output structure Co-authored-by: openhands <openhands@all-hands.dev>
|
I confirmed that this works in my local terminal, merging in. |
Summary
This PR integrates Terminal-Bench evaluation into the benchmarks repository using Harbor with the openhands-sdk agent.
Fixes #48
Changes
New Module:
benchmarks/terminalbench/run_infer.py: Runs Terminal-Bench evaluation via Harbor CLI with openhands-sdk agenteval_infer.py: Processes results and generates evaluation reportsconfig.py: Default configuration for Terminal-Bench evaluationREADME.md: Documentation with usage examplesCLI Entrypoints
terminalbench-infer: Run Terminal-Bench inferenceterminalbench-eval: Process and report resultsTests
Added comprehensive tests covering:
Usage
Architecture
The integration uses:
Testing
All tests pass:
Pre-commit checks pass:
@neubig can click here to continue refining the PR
Evidence
✅ Full E2E Run Verified (3 tasks, real LLM calls, verified results)
Ran 3 Terminal-Bench sample tasks with
claude-sonnet-4-20250514via the LiteLLM proxy. All 3 completed with real agent execution — no auth errors, no exceptions, and verifier results are confirmed correct.1.
terminalbench-infer— Harbor runs agent in Docker containers:2. Agent logs confirm real LLM calls:
3. Verifier results confirmed correct (all are genuine agent failures):
d2d4but the correct checkmate-in-one moves areg2g4ande2e4. Wrong move →reward=0.0✅cmainbinary in/app/polyglot/. Verifier expected onlymain.py.cin that directory →reward=0.0✅/app/sqlite/but didn't add it to PATH. Verifier runssqlite3directly →FileNotFoundError→reward=0.0✅4.
terminalbench-evalprocesses results correctly:5. All 15 unit tests pass:
6. Pre-commit checks pass:
Changes from review (commit 54a3d53)
--output-dir→--jobs-dir,--max-workers→--n-concurrent,--task-id→--task-name--ae(agent env) flags instead of subprocess env--jobs-dirto avoid CWD-relative path issues--timeoutparameterconvert_harbor_to_eval_outputfor actual harbor output structure (TIMESTAMP/TRIAL_NAME/result.json)