You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agent name:Altimate Code Project page:altimate.sh Backbone LLM: Claude Sonnet 4.6 (via OpenRouter — openrouter/anthropic/claude-sonnet-4.6) Hints: Yes (db_description_withhint.txt injected into the user prompt) Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Result
The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on main at submission time.
Metric
Original validators (9031c68ad)
Relaxed validators (5ec934595)
Stratified Pass@1 (leaderboard metric)
0.6187
0.6710
Micro Pass@1 (passes / trials)
0.6963
0.7407
Pass count
188/270
200/270
Note on validator versions. Our trials executed when vendor/DataAgentBench was at commit 9031c68ad. Upstream subsequently merged commits 16ccc3cbd ("Relax 16 validators to accept semantically-correct answers") and 7c94cbf4c ("Relax 3 more validators"), which together updated 17 validate.py files across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers in submission.json.
Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
Reaches for validation skills (sql-review, query-optimize, lineage-diff, sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.
Iterates against errors — at max-turns in headless mode, the agent commits its best-guess answer to ANSWER rather than producing a meta-summary.
Writes one solve.py per query and iterates in place (Edit, not rewrite) until convergence; final answer goes to ANSWER.
Per-dataset Pass@1
Dataset
Original
Relaxed
Δ
bookreview
1.000
1.000
0.000
yelp
0.886
0.914
+0.029
stockindex
0.867
0.933
+0.066
crmarenapro
0.862
0.862
0.000
PANCANCER_ATLAS
0.800
0.800
0.000
agnews
0.800
0.800
0.000
stockmarket
0.760
0.960
+0.200
music_brainz_20k
0.400
0.733
+0.333
googlelocal
0.600
0.600
0.000
GITHUB_REPOS
0.350
0.350
0.000
DEPS_DEV_V1
0.100
0.100
0.000
PATENTS
0.000
0.000
0.000
Note on PATENTS
PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.
We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.
Hi @sahrizvi — thank you for your contribution!
Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result here.
The full set of per-trial traces (270 trials, 54 queries × 5) is attached as a 13.4 MB zip. Layout
described in the included README.md; a copy of submission.json is bundled at the archive root for
self-contained verification. Also, we used dab-improvements-integration branch of Altimate-Code for this run.
Happy to provide additional metadata or adjust the layout if anything's harder to verify than expected. Attachment:dab-traces-altimate-code-n5.zip
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Altimate Code — Leaderboard Submission
Agent name: Altimate Code
Project page: altimate.sh
Backbone LLM: Claude Sonnet 4.6 (via OpenRouter —
openrouter/anthropic/claude-sonnet-4.6)Hints: Yes (
db_description_withhint.txtinjected into the user prompt)Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Result
The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on
mainat submission time.9031c68ad)5ec934595)Note on validator versions. Our trials executed when
vendor/DataAgentBenchwas at commit9031c68ad. Upstream subsequently merged commits16ccc3cbd("Relax 16 validators to accept semantically-correct answers") and7c94cbf4c("Relax 3 more validators"), which together updated 17validate.pyfiles across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers insubmission.json.Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
db_description_withhint.txt(injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.schema_index,schema_search,schema_inspect,sql_execute,warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.sql-review,query-optimize,lineage-diff,sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.ANSWERrather than producing a meta-summary.solve.pyper query and iterates in place (Edit, not rewrite) until convergence; final answer goes toANSWER.Per-dataset Pass@1
Note on PATENTS
PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.
We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.
Configuration