ucbepic · shreyashankar · May 2, 2026 · May 2, 2026 · May 2, 2026 · May 2, 2026
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,28 @@
+# Never bake the dataset answer keys or construction recipes into a runtime
+# container image — agents running inside the container must not be able to
+# read the corruption pipeline or the canonical pre-corruption snapshot.
+
+# Canonical un-corrupted snapshots and corruption manifests (= answer keys)
+query_*/clean/
+
+# Per-dataset construction code (corruption recipes, ground-truth SQL,
+# verifier prompts). The agent-visible DBs in query_*/query_dataset/ are
+# already corrupted; the manual_querycode/ dir holds the recipes that
+# produced them.
+query_*/manual_querycode/
+
+# Local results / traces / scratch
+sdk_runner/results/
+sdk_runner/results_backups/
+.venv/
+.codex/
+.claude/
+.env
+__pycache__/
+*.pyc
+
+# VCS / editor noise
+.git/
+.gitignore
+.gitattributes
+.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -78,4 +78,9 @@ query_stockmarket/query_dataset/stockmarket_symboldefinition/
 query_stockmarket/metadata_mean.txt
 query_yelp/ground_truth_dataset/
 query_yelp/manual_querycode/
-query_yelp/query*/ground_truth.py
+query_yelp/query*/ground_truth.py
+# query_cve / query_usaspending: canonical pre-corruption snapshot + corruption
+# manifest are the answer key — keep local-only. Construction code in
+# manual_querycode/ IS shipped (see PROVENANCE.md) for full reproducibility.
+query_cve/clean/
+query_usaspending/clean/
diff --git a/query_cve/db_config.yaml b/query_cve/db_config.yaml
@@ -0,0 +1,15 @@
+db_clients:
+  vulns_db:
+    db_type: sqlite
+    db_path: query_dataset/vulns.db
+  cpe_db:
+    db_type: duckdb
+    db_path: query_dataset/cpe.duckdb
+  kev_db:
+    db_type: postgres
+    db_name: cve_kev
+    sql_file: query_dataset/kev.sql
+  descriptions_db:
+    db_type: mongo
+    db_name: cve_descriptions
+    dump_folder: query_dataset/descriptions
diff --git a/query_cve/db_description.txt b/query_cve/db_description.txt
@@ -0,0 +1,76 @@
+You are working with four databases to solve this query.
+
+Here are the descriptions of these four databases:
+
+1. vulns_db
+   - This database is stored in a SQLite database. It contains the structured NVD CVE registry: per-CVE publication metadata, CVSS v3 attack-vector information, and a sibling table holding the per-CVE CVSS score.
+   - This database consists of two tables:
+     - cves
+       - Core CVE registry.
+       - Fields:
+         - cve_id (str): CVE identifier (e.g. "CVE-2023-12345")
+         - published (str): ISO timestamp when the CVE was published
+         - last_modified (str): ISO timestamp of last NVD modification
+         - vuln_status (str): NVD workflow status (e.g. "Analyzed", "Modified")
+         - cvss3_attack_vector (str or null): NETWORK / ADJACENT_NETWORK / LOCAL / PHYSICAL
+     - cvss_metadata
+       - Per-CVE CVSS v3 base score.
+       - Fields:
+         - cve_id (str): CVE identifier
+         - score_text (str): Score and severity classification
+
+2. cpe_db
+   - This database is stored in a DuckDB database. It contains the Common Platform Enumeration (CPE) match list — which products and versions are affected by each CVE — plus a vendor reference and version-details tables.
+   - This database consists of three tables:
+     - cpe_matches
+       - One row per (CVE, affected product configuration) pair.
+       - Fields:
+         - cve_id (str): CVE identifier
+         - criteria (str): Product identifier string
+         - vulnerable_flag (str): Indicates whether the configuration is vulnerable
+     - vendor_aliases
+       - Vendor reference table.
+       - Fields:
+         - alias (str)
+         - canonical_vendor (str): Canonical lowercase vendor name (e.g. "apache", "microsoft")
+     - cpe_version_details
+       - Per-(CVE, criteria) version information.
+       - Fields:
+         - cve_id (str): CVE identifier
+         - criteria (str): The CPE criteria string this version detail belongs to
+         - version_text (str or null): Affected version
+         - version_start_inc (str or null): Inclusive lower bound of affected versions
+         - version_start_exc (str or null): Exclusive lower bound of affected versions
+         - version_end_inc (str or null): Inclusive upper bound of affected versions
+         - version_end_exc (str or null): Exclusive upper bound of affected versions
+
+3. kev_db
+   - This database is stored in a PostgreSQL database. It contains the CISA Known Exploited Vulnerabilities (KEV) catalog — CVEs that have been exploited in the wild.
+   - This database consists of one table:
+     - kev_entries
+       - One row per KEV-listed vulnerability.
+       - Fields:
+         - cve_ref (str): CVE identifier as supplied by CISA
+         - vendor_project (str): Vendor or project responsible for the affected product
+         - products_csv (str): Affected product name(s)
+         - vulnerability_name (str): Short human-readable name
+         - date_added (str): Date the CVE was added to the KEV catalog
+         - short_description (str): Brief description of the vulnerability
+         - required_action (str): Action CISA requires affected agencies to take
+         - due_date (str): Deadline by which required action must be completed
+         - known_ransomware_use (str): "Known", "Unknown", or null
+         - notes (str or null): Additional notes
+
+4. descriptions_db
+   - This database is stored in a MongoDB database. It contains free-text descriptions and external references for each CVE.
+   - This database consists of one collection:
+     - cve_documents
+       - One document per CVE with embedded descriptions[] and references[].
+       - Fields:
+         - cve (str): CVE identifier
+         - descriptions (list of dict): Free-text descriptions in one or more languages, each with:
+           - language (str): ISO language code (e.g. "en", "es")
+           - value (str): Description prose
+         - references (list of dict): External reference URLs, each with:
+           - url (str): Reference URL
+           - source (str or null): Source organization that supplied the reference
diff --git a/query_cve/db_description_withhint.txt b/query_cve/db_description_withhint.txt
@@ -0,0 +1,12 @@
+HINTS:
+- CVE identifiers appear under multiple surface-form variants across the four databases. Cross-DB joins on CVE id require canonicalization.
+- cpe_db.cpe_matches.criteria contains a vendor alias rather than the canonical vendor name; cpe_db.vendor_aliases resolves aliases to canonical vendor names.
+- cpe_db.cpe_matches.vulnerable_flag is free-text — multiple truthy and falsy tokens appear; normalization is needed.
+- cpe_db.cpe_version_details.version_text is stored in a mixed encoding; equivalent versions may not match by string equality.
+- kev_db.kev_entries.vendor_project may appear under multiple surface forms for the same canonical vendor; clustering is needed for grouping or counting.
+- kev_db.kev_entries.products_csv may contain comma-separated lists; split to recover individual products.
+- Some KEV entries reference CVEs that are not present in vulns_db.
+- vulns_db.cves contains a small subset of CVEs that appear in more than one row with conflicting attribute values.
+- vulns_db.cvss_metadata.score_text mixes templated and narrative encodings; the numeric score must be inferred from prose for narrative rows.
+- descriptions_db.cve_documents.descriptions[].value: severity classification words do not appear literally in any English description; severity is encoded either as an opaque tagline or implied by narrative content.
+- A subset of CVEs are missing the English description and only carry a non-English description.