Add query_cve and query_usaspending datasets (10 queries each)#43
Open
shreyashankar wants to merge 6 commits intomainfrom
Open
Add query_cve and query_usaspending datasets (10 queries each)#43shreyashankar wants to merge 6 commits intomainfrom
shreyashankar wants to merge 6 commits intomainfrom
Conversation
added 6 commits
May 1, 2026 20:54
query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + descriptions,
across 4 DBMSes (SQLite, DuckDB, Postgres, Mongo). Bounded to 2023-2024 NVD
window (~71k CVEs, ~585k CPE matches, 1583 KEV entries). 10 queries each
exercising at least 2 of DAB's 4 properties (multi-DB, ill-formatted joins,
unstructured text transformation, domain knowledge).
query_usaspending: federal contract awards + SAM-derived entity registry +
agency hierarchy + free-text descriptions, across the same 4 DBMSes.
~9.9k FY2024-Q4 contracts sampled by start date (representative spread of
amounts so threshold filters actually filter). 10 queries with the same
property coverage requirement.
Corruption layers (deterministic, hash-keyed unless noted):
- cross-DB ID format mixing (CVE-id, award_id, UEI)
- vendor / agency surface-form variant clustering
- vendor-alias indirection in CPE
- vulnerable_flag varied truthy/falsy tokens
- version-encoding mix in CPE versions
- English-description dropping for a deterministic subset
- duplicate rows with conflicting metadata
- LLM-generated per-row narrative obfuscation of CVSS scores (cve) and
contract amounts (usaspending), with roundtrip-classifier verification:
each generated narrative must be classified back to its canonical band
by an independent LLM call before being accepted; mismatches retry up
to 3 times. usaspending narratives anchor on per-row recipient/agency/
NAICS context, yielding 99.9% distinct phrasings across ~9.5k rows.
Sonnet pass@1 (plain mode): cve ~85%, usaspending 50%.
Three 12KB empty DuckDB/SQLite files were created at query_usaspending/ top-level (instead of query_dataset/) when sonnet runs accidentally opened DB connections from the wrong cwd. The real agencies.duckdb is in query_dataset/agencies.duckdb.
The original gpt-4o-mini score narratives collapsed to ~70% distinct because
each row's prompt only saw the canonical band ('7.0-8.9') without
CVE-specific context. Switched to gpt-4o at temp 0.95 with per-row vendor /
product / vulnerability-summary anchoring — narratives are now 100% distinct
and Sonnet can no longer cover them with a small CASE+ILIKE.
Also tightened the band-specific prompt so phrasings like 'just shy of the
pinnacle' (which imply 9.0-10.0) cannot be generated for the 7.0-8.9 band.
Roundtrip verifier (canonical band -> classifier band match) is unchanged.
~225 of 321 KEV-CVE score rows are now narrative; the rest fall back to the
deterministic templated format from corrupt.py.
Sonnet pass@1 on cve dropped from ~85% to ~30% with the diverse narratives.
PROVENANCE.md documents source URLs, fetch dates, scope, DB engine assignments, deterministic and LLM-driven corruption layer categories, verifier audit statistics, and SHA-256 hashes of every shipped agent-visible artifact. Reviewers can audit *what* was done without needing the corruption recipe. .gitignore additions match the existing convention used by query_googlelocal / query_yelp / query_stockmarket / query_stockindex (manual_querycode/) and the new clean/ canonical-snapshot pattern.
Reverses the earlier decision to gitignore manual_querycode/ for these
two datasets. Full reproducibility wins over hiding the corruption recipe
because: (a) the runtime sandbox already denies agents from reading
manual_querycode/ and clean/ via the SDK runner's pre-tool hooks, and
(b) reviewers can audit *how* the corruption was constructed, not just
*what* it touches.
What lands in the repo per dataset:
- manual_querycode/fetch_clean.py downloads canonical source data
- manual_querycode/corrupt.py deterministic transforms (cve has
7 layers; usaspending has 6)
- manual_querycode/llm_corrupt.py LLM-driven per-row narrative with
roundtrip-classifier verification
- manual_querycode/audit_corruption.py (cve only) re-classifies every
narrative; flags mismatches for
regeneration
- manual_querycode/compute_ground_truth.py per-query GT functions
PROVENANCE.md is expanded to inline:
- the LLM rewrite prompts (severity-as-prose for cve, score-as-prose
for cve, amount-as-prose for usaspending)
- the SEVERITY_MOTIFS map keyed by canonical severity
- the band thresholds (USD bands for usaspending; CVSS bands for cve)
- the verifier classifier prompts
- representative ground-truth SQL/Python excerpts
clean/ stays gitignored (canonical source-of-truth + corruption manifest =
answer key).
…wer-key paths PROVENANCE.md belongs alongside the construction code it documents; moving it into manual_querycode/ groups them. Reviewers cd into query_*/manual_querycode/ to read the recipe + the prose explanation together. .dockerignore prevents future container builds from baking in the canonical pre-corruption snapshot (query_*/clean/) or the corruption recipes (query_*/manual_querycode/) — defense in depth on top of the runtime hooks in sdk_runner/run_sdk_agent.py that already deny agents from reading those paths. Also dropped the redundant __pycache__ gitignore lines (the global __pycache__/ rule already covers them).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
query_cve: NVD CVE registry + CISA KEV catalog + CPE matches + free-text descriptions across 4 DBMSes (SQLite/DuckDB/Postgres/Mongo). Bounded to 2023–2024 NVD window (~71k CVEs, ~585k CPE matches, 1583 KEV entries).query_usaspending: federal contract awards + SAM-derived entity registry + agency hierarchy + free-text descriptions across the same 4 DBMSes. ~9.9k FY2024-Q4 contracts sampled by start date (representative spread of award amounts so threshold filters actually filter).Both datasets follow the existing convention — agent-visible corrupted DBs in
query_dataset/, plusdb_config.yaml,db_description.txt(clean schema only),db_description_withhint.txt(terse one-line corruption hints inHINTS:style matchingquery_yelp/query_crmarenapro), andquery{1..10}/{query.json, ground_truth.csv, validate.py}. The canonical un-corrupted snapshot and corruption manifest are kept local-only (gitignored), matching what other datasets ship.Corruption layers
Hash-deterministic transforms across both datasets:
vulnerable_flagvaried truthy/falsy tokensPlus per-row LLM narrative corruption with roundtrip-classifier verification:
query_cve.cvss_metadata.score_text: gpt-4o narrative implying CVSS magnitude band, anchored on per-row vendor/product/vulnerability summary. 100% distinct phrasings across ~225 KEV-CVE rows.query_cve.cve_documents.descriptions[].value: gpt-4o severity-as-prose narrative, anchored on the unique CVE description. 100% distinct across ~294 KEV-CVE rows.query_usaspending.contract_amounts.amount_text: gpt-4o narrative implying dollar magnitude band, anchored on per-row recipient/agency/NAICS. 100% distinct across ~9.5k rows.Verifier pattern: each generated narrative is classified back to its canonical band by an independent LLM call before being accepted. Mismatches retry up to 3 times. Final post-audit mismatch rate is <2% on cve, <1% on usaspending.
Sonnet pass@1 (plain mode)
Both well within DAB's hardness range (Gemini-3-Pro hits 38% on the published 12-dataset benchmark). The diverse, contextually-anchored narratives defeat single-pass `CASE WHEN ... ILIKE` shortcuts on aggregation queries; agents have to do real per-row classification.
Test plan
🤖 Generated with Claude Code