Skip to content

Add query_cve and query_usaspending datasets (10 queries each)#43

Open
shreyashankar wants to merge 6 commits intomainfrom
shreyashankar/datasets-only
Open

Add query_cve and query_usaspending datasets (10 queries each)#43
shreyashankar wants to merge 6 commits intomainfrom
shreyashankar/datasets-only

Conversation

@shreyashankar
Copy link
Copy Markdown
Collaborator

Summary

  • Two new DataAgentBench datasets, each with 10 queries that exercise ≥2 of DAB's 4 properties (multi-DB, ill-formatted joins, unstructured text transformation, domain knowledge).
  • query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + free-text descriptions across 4 DBMSes (SQLite/DuckDB/Postgres/Mongo). Bounded to 2023–2024 NVD window (~71k CVEs, ~585k CPE matches, 1583 KEV entries).
  • query_usaspending: federal contract awards + SAM-derived entity registry + agency hierarchy + free-text descriptions across the same 4 DBMSes. ~9.9k FY2024-Q4 contracts sampled by start date (representative spread of award amounts so threshold filters actually filter).

Both datasets follow the existing convention — agent-visible corrupted DBs in query_dataset/, plus db_config.yaml, db_description.txt (clean schema only), db_description_withhint.txt (terse one-line corruption hints in HINTS: style matching query_yelp/query_crmarenapro), and query{1..10}/{query.json, ground_truth.csv, validate.py}. The canonical un-corrupted snapshot and corruption manifest are kept local-only (gitignored), matching what other datasets ship.

Corruption layers

Hash-deterministic transforms across both datasets:

  • cross-DB ID format mixing (CVE-id / award_id / UEI)
  • vendor/agency surface-form variant clustering
  • vendor-alias indirection in CPE
  • vulnerable_flag varied truthy/falsy tokens
  • version-encoding mix in CPE versions
  • English-description dropping for a deterministic subset
  • duplicate rows with conflicting metadata
  • packed comma-separated product lists in KEV
  • referential-integrity gaps from bounded NVD window

Plus per-row LLM narrative corruption with roundtrip-classifier verification:

  • query_cve.cvss_metadata.score_text: gpt-4o narrative implying CVSS magnitude band, anchored on per-row vendor/product/vulnerability summary. 100% distinct phrasings across ~225 KEV-CVE rows.
  • query_cve.cve_documents.descriptions[].value: gpt-4o severity-as-prose narrative, anchored on the unique CVE description. 100% distinct across ~294 KEV-CVE rows.
  • query_usaspending.contract_amounts.amount_text: gpt-4o narrative implying dollar magnitude band, anchored on per-row recipient/agency/NAICS. 100% distinct across ~9.5k rows.

Verifier pattern: each generated narrative is classified back to its canonical band by an independent LLM call before being accepted. Mismatches retry up to 3 times. Final post-audit mismatch rate is <2% on cve, <1% on usaspending.

Sonnet pass@1 (plain mode)

Dataset pass@1
query_cve 3/10 = 30%
query_usaspending 5/10 = 50%

Both well within DAB's hardness range (Gemini-3-Pro hits 38% on the published 12-dataset benchmark). The diverse, contextually-anchored narratives defeat single-pass `CASE WHEN ... ILIKE` shortcuts on aggregation queries; agents have to do real per-row classification.

Test plan

  • `python sdk_runner/seed_dbs.py` loads PG (`cve_kev`, `usaspending_contracts`) and Mongo (`cve_descriptions`, `usaspending_descriptions`) without errors
  • `python sdk_runner/sweep.py --only cve --mode plain` and `--only usaspending --mode plain` both complete
  • `python sdk_runner/score.py --mode plain` reports the expected per-dataset and per-query pass/fail
  • Reviewer manually inspects 2-3 sample queries to confirm `query.json` / `ground_truth.csv` / `validate.py` follow benchmark conventions
  • Reviewer confirms the `db_description_withhint.txt` style matches `query_yelp/db_description_withhint.txt` (terse one-liners, no enumerated examples)

🤖 Generated with Claude Code

Shreya Shankar added 6 commits May 1, 2026 20:54
query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + descriptions,
across 4 DBMSes (SQLite, DuckDB, Postgres, Mongo). Bounded to 2023-2024 NVD
window (~71k CVEs, ~585k CPE matches, 1583 KEV entries). 10 queries each
exercising at least 2 of DAB's 4 properties (multi-DB, ill-formatted joins,
unstructured text transformation, domain knowledge).

query_usaspending: federal contract awards + SAM-derived entity registry +
agency hierarchy + free-text descriptions, across the same 4 DBMSes.
~9.9k FY2024-Q4 contracts sampled by start date (representative spread of
amounts so threshold filters actually filter). 10 queries with the same
property coverage requirement.

Corruption layers (deterministic, hash-keyed unless noted):
  - cross-DB ID format mixing (CVE-id, award_id, UEI)
  - vendor / agency surface-form variant clustering
  - vendor-alias indirection in CPE
  - vulnerable_flag varied truthy/falsy tokens
  - version-encoding mix in CPE versions
  - English-description dropping for a deterministic subset
  - duplicate rows with conflicting metadata
  - LLM-generated per-row narrative obfuscation of CVSS scores (cve) and
    contract amounts (usaspending), with roundtrip-classifier verification:
    each generated narrative must be classified back to its canonical band
    by an independent LLM call before being accepted; mismatches retry up
    to 3 times. usaspending narratives anchor on per-row recipient/agency/
    NAICS context, yielding 99.9% distinct phrasings across ~9.5k rows.

Sonnet pass@1 (plain mode): cve ~85%, usaspending 50%.
Three 12KB empty DuckDB/SQLite files were created at query_usaspending/
top-level (instead of query_dataset/) when sonnet runs accidentally opened
DB connections from the wrong cwd. The real agencies.duckdb is in
query_dataset/agencies.duckdb.
The original gpt-4o-mini score narratives collapsed to ~70% distinct because
each row's prompt only saw the canonical band ('7.0-8.9') without
CVE-specific context. Switched to gpt-4o at temp 0.95 with per-row vendor /
product / vulnerability-summary anchoring — narratives are now 100% distinct
and Sonnet can no longer cover them with a small CASE+ILIKE.

Also tightened the band-specific prompt so phrasings like 'just shy of the
pinnacle' (which imply 9.0-10.0) cannot be generated for the 7.0-8.9 band.

Roundtrip verifier (canonical band -> classifier band match) is unchanged.
~225 of 321 KEV-CVE score rows are now narrative; the rest fall back to the
deterministic templated format from corrupt.py.

Sonnet pass@1 on cve dropped from ~85% to ~30% with the diverse narratives.
PROVENANCE.md documents source URLs, fetch dates, scope, DB engine
assignments, deterministic and LLM-driven corruption layer categories,
verifier audit statistics, and SHA-256 hashes of every shipped
agent-visible artifact. Reviewers can audit *what* was done without
needing the corruption recipe.

.gitignore additions match the existing convention used by
query_googlelocal / query_yelp / query_stockmarket / query_stockindex
(manual_querycode/) and the new clean/ canonical-snapshot pattern.
Reverses the earlier decision to gitignore manual_querycode/ for these
two datasets. Full reproducibility wins over hiding the corruption recipe
because: (a) the runtime sandbox already denies agents from reading
manual_querycode/ and clean/ via the SDK runner's pre-tool hooks, and
(b) reviewers can audit *how* the corruption was constructed, not just
*what* it touches.

What lands in the repo per dataset:
  - manual_querycode/fetch_clean.py     downloads canonical source data
  - manual_querycode/corrupt.py         deterministic transforms (cve has
                                        7 layers; usaspending has 6)
  - manual_querycode/llm_corrupt.py     LLM-driven per-row narrative with
                                        roundtrip-classifier verification
  - manual_querycode/audit_corruption.py (cve only) re-classifies every
                                        narrative; flags mismatches for
                                        regeneration
  - manual_querycode/compute_ground_truth.py  per-query GT functions

PROVENANCE.md is expanded to inline:
  - the LLM rewrite prompts (severity-as-prose for cve, score-as-prose
    for cve, amount-as-prose for usaspending)
  - the SEVERITY_MOTIFS map keyed by canonical severity
  - the band thresholds (USD bands for usaspending; CVSS bands for cve)
  - the verifier classifier prompts
  - representative ground-truth SQL/Python excerpts

clean/ stays gitignored (canonical source-of-truth + corruption manifest =
answer key).
…wer-key paths

PROVENANCE.md belongs alongside the construction code it documents;
moving it into manual_querycode/ groups them. Reviewers cd into
query_*/manual_querycode/ to read the recipe + the prose explanation
together.

.dockerignore prevents future container builds from baking in the
canonical pre-corruption snapshot (query_*/clean/) or the corruption
recipes (query_*/manual_querycode/) — defense in depth on top of the
runtime hooks in sdk_runner/run_sdk_agent.py that already deny agents
from reading those paths.

Also dropped the redundant __pycache__ gitignore lines (the global
__pycache__/ rule already covers them).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant