Add query_cve and query_usaspending datasets (10 queries each) by shreyashankar · Pull Request #43 · ucbepic/DataAgentBench

shreyashankar · 2026-05-02T03:55:30Z

Summary

Two new DataAgentBench datasets, each with 10 queries that exercise ≥2 of DAB's 4 properties (multi-DB, ill-formatted joins, unstructured text transformation, domain knowledge).
query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + free-text descriptions across 4 DBMSes (SQLite/DuckDB/Postgres/Mongo). Bounded to 2023–2024 NVD window (~71k CVEs, ~585k CPE matches, 1583 KEV entries).
query_usaspending: federal contract awards + SAM-derived entity registry + agency hierarchy + free-text descriptions across the same 4 DBMSes. ~9.9k FY2024-Q4 contracts sampled by start date (representative spread of award amounts so threshold filters actually filter).

Both datasets follow the existing convention — agent-visible corrupted DBs in query_dataset/, plus db_config.yaml, db_description.txt (clean schema only), db_description_withhint.txt (terse one-line corruption hints in HINTS: style matching query_yelp/query_crmarenapro), and query{1..10}/{query.json, ground_truth.csv, validate.py}. The canonical un-corrupted snapshot and corruption manifest are kept local-only (gitignored), matching what other datasets ship.

Corruption layers

Hash-deterministic transforms across both datasets:

cross-DB ID format mixing (CVE-id / award_id / UEI)
vendor/agency surface-form variant clustering
vendor-alias indirection in CPE
vulnerable_flag varied truthy/falsy tokens
version-encoding mix in CPE versions
English-description dropping for a deterministic subset
duplicate rows with conflicting metadata
packed comma-separated product lists in KEV
referential-integrity gaps from bounded NVD window

Plus per-row LLM narrative corruption with roundtrip-classifier verification:

query_cve.cvss_metadata.score_text: gpt-4o narrative implying CVSS magnitude band, anchored on per-row vendor/product/vulnerability summary. 100% distinct phrasings across ~225 KEV-CVE rows.
query_cve.cve_documents.descriptions[].value: gpt-4o severity-as-prose narrative, anchored on the unique CVE description. 100% distinct across ~294 KEV-CVE rows.
query_usaspending.contract_amounts.amount_text: gpt-4o narrative implying dollar magnitude band, anchored on per-row recipient/agency/NAICS. 100% distinct across ~9.5k rows.

Verifier pattern: each generated narrative is classified back to its canonical band by an independent LLM call before being accepted. Mismatches retry up to 3 times. Final post-audit mismatch rate is <2% on cve, <1% on usaspending.

Sonnet pass@1 (plain mode)

Dataset	pass@1
query_cve	3/10 = 30%
query_usaspending	5/10 = 50%

Both well within DAB's hardness range (Gemini-3-Pro hits 38% on the published 12-dataset benchmark). The diverse, contextually-anchored narratives defeat single-pass `CASE WHEN ... ILIKE` shortcuts on aggregation queries; agents have to do real per-row classification.

Test plan

`python sdk_runner/seed_dbs.py` loads PG (`cve_kev`, `usaspending_contracts`) and Mongo (`cve_descriptions`, `usaspending_descriptions`) without errors
`python sdk_runner/sweep.py --only cve --mode plain` and `--only usaspending --mode plain` both complete
`python sdk_runner/score.py --mode plain` reports the expected per-dataset and per-query pass/fail
Reviewer manually inspects 2-3 sample queries to confirm `query.json` / `ground_truth.csv` / `validate.py` follow benchmark conventions
Reviewer confirms the `db_description_withhint.txt` style matches `query_yelp/db_description_withhint.txt` (terse one-liners, no enumerated examples)

🤖 Generated with Claude Code

query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + descriptions, across 4 DBMSes (SQLite, DuckDB, Postgres, Mongo). Bounded to 2023-2024 NVD window (~71k CVEs, ~585k CPE matches, 1583 KEV entries). 10 queries each exercising at least 2 of DAB's 4 properties (multi-DB, ill-formatted joins, unstructured text transformation, domain knowledge). query_usaspending: federal contract awards + SAM-derived entity registry + agency hierarchy + free-text descriptions, across the same 4 DBMSes. ~9.9k FY2024-Q4 contracts sampled by start date (representative spread of amounts so threshold filters actually filter). 10 queries with the same property coverage requirement. Corruption layers (deterministic, hash-keyed unless noted): - cross-DB ID format mixing (CVE-id, award_id, UEI) - vendor / agency surface-form variant clustering - vendor-alias indirection in CPE - vulnerable_flag varied truthy/falsy tokens - version-encoding mix in CPE versions - English-description dropping for a deterministic subset - duplicate rows with conflicting metadata - LLM-generated per-row narrative obfuscation of CVSS scores (cve) and contract amounts (usaspending), with roundtrip-classifier verification: each generated narrative must be classified back to its canonical band by an independent LLM call before being accepted; mismatches retry up to 3 times. usaspending narratives anchor on per-row recipient/agency/ NAICS context, yielding 99.9% distinct phrasings across ~9.5k rows. Sonnet pass@1 (plain mode): cve ~85%, usaspending 50%.

Three 12KB empty DuckDB/SQLite files were created at query_usaspending/ top-level (instead of query_dataset/) when sonnet runs accidentally opened DB connections from the wrong cwd. The real agencies.duckdb is in query_dataset/agencies.duckdb.

The original gpt-4o-mini score narratives collapsed to ~70% distinct because each row's prompt only saw the canonical band ('7.0-8.9') without CVE-specific context. Switched to gpt-4o at temp 0.95 with per-row vendor / product / vulnerability-summary anchoring — narratives are now 100% distinct and Sonnet can no longer cover them with a small CASE+ILIKE. Also tightened the band-specific prompt so phrasings like 'just shy of the pinnacle' (which imply 9.0-10.0) cannot be generated for the 7.0-8.9 band. Roundtrip verifier (canonical band -> classifier band match) is unchanged. ~225 of 321 KEV-CVE score rows are now narrative; the rest fall back to the deterministic templated format from corrupt.py. Sonnet pass@1 on cve dropped from ~85% to ~30% with the diverse narratives.

PROVENANCE.md documents source URLs, fetch dates, scope, DB engine assignments, deterministic and LLM-driven corruption layer categories, verifier audit statistics, and SHA-256 hashes of every shipped agent-visible artifact. Reviewers can audit *what* was done without needing the corruption recipe. .gitignore additions match the existing convention used by query_googlelocal / query_yelp / query_stockmarket / query_stockindex (manual_querycode/) and the new clean/ canonical-snapshot pattern.

Reverses the earlier decision to gitignore manual_querycode/ for these two datasets. Full reproducibility wins over hiding the corruption recipe because: (a) the runtime sandbox already denies agents from reading manual_querycode/ and clean/ via the SDK runner's pre-tool hooks, and (b) reviewers can audit *how* the corruption was constructed, not just *what* it touches. What lands in the repo per dataset: - manual_querycode/fetch_clean.py downloads canonical source data - manual_querycode/corrupt.py deterministic transforms (cve has 7 layers; usaspending has 6) - manual_querycode/llm_corrupt.py LLM-driven per-row narrative with roundtrip-classifier verification - manual_querycode/audit_corruption.py (cve only) re-classifies every narrative; flags mismatches for regeneration - manual_querycode/compute_ground_truth.py per-query GT functions PROVENANCE.md is expanded to inline: - the LLM rewrite prompts (severity-as-prose for cve, score-as-prose for cve, amount-as-prose for usaspending) - the SEVERITY_MOTIFS map keyed by canonical severity - the band thresholds (USD bands for usaspending; CVSS bands for cve) - the verifier classifier prompts - representative ground-truth SQL/Python excerpts clean/ stays gitignored (canonical source-of-truth + corruption manifest = answer key).

…wer-key paths PROVENANCE.md belongs alongside the construction code it documents; moving it into manual_querycode/ groups them. Reviewers cd into query_*/manual_querycode/ to read the recipe + the prose explanation together. .dockerignore prevents future container builds from baking in the canonical pre-corruption snapshot (query_*/clean/) or the corruption recipes (query_*/manual_querycode/) — defense in depth on top of the runtime hooks in sdk_runner/run_sdk_agent.py that already deny agents from reading those paths. Also dropped the redundant __pycache__ gitignore lines (the global __pycache__/ rule already covers them).

Shreya Shankar added 6 commits May 1, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query_cve and query_usaspending datasets (10 queries each)#43

Add query_cve and query_usaspending datasets (10 queries each)#43
shreyashankar wants to merge 6 commits intomainfrom
shreyashankar/datasets-only

shreyashankar commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyashankar commented May 2, 2026

Summary

Corruption layers

Sonnet pass@1 (plain mode)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant