[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators) by sahrizvi · Pull Request #44 · ucbepic/DataAgentBench

sahrizvi · 2026-05-03T10:52:00Z

Altimate Code — Leaderboard Submission

Agent name: Altimate Code
Project page: altimate.sh
Backbone LLM: Claude Sonnet 4.6 (via OpenRouter — openrouter/anthropic/claude-sonnet-4.6)
Hints: Yes (db_description_withhint.txt injected into the user prompt)
Trials: 5 per query (270 trials total across 12 datasets, 54 queries)

Result

The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on main at submission time.

Metric	Original validators (`9031c68ad`)	Relaxed validators (`5ec934595`)
Stratified Pass@1 (leaderboard metric)	0.6187	0.6710
Micro Pass@1 (passes / trials)	0.6963	0.7407
Pass count	188/270	200/270

Note on validator versions. Our trials executed when vendor/DataAgentBench was at commit 9031c68ad. Upstream subsequently merged commits 16ccc3cbd ("Relax 16 validators to accept semantically-correct answers") and 7c94cbf4c ("Relax 3 more validators"), which together updated 17 validate.py files across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers in submission.json.

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
Reaches for validation skills (sql-review, query-optimize, lineage-diff, sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.
Iterates against errors — at max-turns in headless mode, the agent commits its best-guess answer to ANSWER rather than producing a meta-summary.
Writes one solve.py per query and iterates in place (Edit, not rewrite) until convergence; final answer goes to ANSWER.

Per-dataset Pass@1

Dataset	Original	Relaxed	Δ
bookreview	1.000	1.000	0.000
yelp	0.886	0.914	+0.029
stockindex	0.867	0.933	+0.066
crmarenapro	0.862	0.862	0.000
PANCANCER_ATLAS	0.800	0.800	0.000
agnews	0.800	0.800	0.000
stockmarket	0.760	0.960	+0.200
music_brainz_20k	0.400	0.733	+0.333
googlelocal	0.600	0.600	0.000
GITHUB_REPOS	0.350	0.350	0.000
DEPS_DEV_V1	0.100	0.100	0.000
PATENTS	0.000	0.000	0.000

Note on PATENTS

PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.

We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.

Configuration

Max turns: 75 per trial
Per-trial timeout: 2000s
Concurrency: 4 trials in parallel
Wall-clock: ~4h 2m for the full 270-trial run

Ruiying-Ma · 2026-05-05T01:33:45Z

Hi @sahrizvi — thank you for your contribution!
Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result here.

sahrizvi · 2026-05-05T12:34:22Z

Hi @Ruiying-Ma! Thanks for the quick turnaround.

The full set of per-trial traces (270 trials, 54 queries × 5) is attached as a 13.4 MB zip. Layout
described in the included README.md; a copy of submission.json is bundled at the archive root for
self-contained verification. Also, we used dab-improvements-integration branch of Altimate-Code for this run.

Happy to provide additional metadata or adjust the layout if anything's harder to verify than expected.
Attachment: dab-traces-altimate-code-n5.zip

Add Altimate Code leaderboard submission (Claude Sonnet 4.6, n=5)

7fee235

sahrizvi changed the title ~~[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators)~~ [Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators) May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44

[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44
sahrizvi wants to merge 1 commit intoucbepic:mainfrom
sahrizvi:submission/altimate-code-sonnet-46-n5

sahrizvi commented May 3, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented May 5, 2026

Uh oh!

sahrizvi commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sahrizvi commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Altimate Code — Leaderboard Submission

Result

Architecture

Per-dataset Pass@1

Note on PATENTS

Configuration

Uh oh!

Ruiying-Ma commented May 5, 2026

Uh oh!

sahrizvi commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahrizvi commented May 3, 2026 •

edited

Loading

sahrizvi commented May 5, 2026 •

edited

Loading