Skip to content

[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44

Open
sahrizvi wants to merge 1 commit intoucbepic:mainfrom
sahrizvi:submission/altimate-code-sonnet-46-n5
Open

[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44
sahrizvi wants to merge 1 commit intoucbepic:mainfrom
sahrizvi:submission/altimate-code-sonnet-46-n5

Conversation

@sahrizvi
Copy link
Copy Markdown

@sahrizvi sahrizvi commented May 3, 2026

Altimate Code — Leaderboard Submission

Agent name: Altimate Code
Project page: altimate.sh
Backbone LLM: Claude Sonnet 4.6 (via OpenRouter — openrouter/anthropic/claude-sonnet-4.6)
Hints: Yes (db_description_withhint.txt injected into the user prompt)
Trials: 5 per query (270 trials total across 12 datasets, 54 queries)

Result

The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on main at submission time.

Metric Original validators (9031c68ad) Relaxed validators (5ec934595)
Stratified Pass@1 (leaderboard metric) 0.6187 0.6710
Micro Pass@1 (passes / trials) 0.6963 0.7407
Pass count 188/270 200/270

Note on validator versions. Our trials executed when vendor/DataAgentBench was at commit 9031c68ad. Upstream subsequently merged commits 16ccc3cbd ("Relax 16 validators to accept semantically-correct answers") and 7c94cbf4c ("Relax 3 more validators"), which together updated 17 validate.py files across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers in submission.json.

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

  1. Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
  2. Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
  3. Reaches for validation skills (sql-review, query-optimize, lineage-diff, sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.
  4. Iterates against errors — at max-turns in headless mode, the agent commits its best-guess answer to ANSWER rather than producing a meta-summary.
  5. Writes one solve.py per query and iterates in place (Edit, not rewrite) until convergence; final answer goes to ANSWER.

Per-dataset Pass@1

Dataset Original Relaxed Δ
bookreview 1.000 1.000 0.000
yelp 0.886 0.914 +0.029
stockindex 0.867 0.933 +0.066
crmarenapro 0.862 0.862 0.000
PANCANCER_ATLAS 0.800 0.800 0.000
agnews 0.800 0.800 0.000
stockmarket 0.760 0.960 +0.200
music_brainz_20k 0.400 0.733 +0.333
googlelocal 0.600 0.600 0.000
GITHUB_REPOS 0.350 0.350 0.000
DEPS_DEV_V1 0.100 0.100 0.000
PATENTS 0.000 0.000 0.000

Note on PATENTS

PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.

We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.

Configuration

  • Max turns: 75 per trial
  • Per-trial timeout: 2000s
  • Concurrency: 4 trials in parallel
  • Wall-clock: ~4h 2m for the full 270-trial run

@sahrizvi sahrizvi changed the title [Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators) [Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators) May 3, 2026
@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

Hi @sahrizvi — thank you for your contribution!
Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result here.

@sahrizvi
Copy link
Copy Markdown
Author

sahrizvi commented May 5, 2026

Hi @Ruiying-Ma! Thanks for the quick turnaround.

The full set of per-trial traces (270 trials, 54 queries × 5) is attached as a 13.4 MB zip. Layout
described in the included README.md; a copy of submission.json is bundled at the archive root for
self-contained verification. Also, we used dab-improvements-integration branch of Altimate-Code for this run.

Happy to provide additional metadata or adjust the layout if anything's harder to verify than expected.
Attachment: dab-traces-altimate-code-n5.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants