Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Never bake the dataset answer keys or construction recipes into a runtime
# container image — agents running inside the container must not be able to
# read the corruption pipeline or the canonical pre-corruption snapshot.

# Canonical un-corrupted snapshots and corruption manifests (= answer keys)
query_*/clean/

# Per-dataset construction code (corruption recipes, ground-truth SQL,
# verifier prompts). The agent-visible DBs in query_*/query_dataset/ are
# already corrupted; the manual_querycode/ dir holds the recipes that
# produced them.
query_*/manual_querycode/

# Local results / traces / scratch
sdk_runner/results/
sdk_runner/results_backups/
.venv/
.codex/
.claude/
.env
__pycache__/
*.pyc

# VCS / editor noise
.git/
.gitignore
.gitattributes
.DS_Store
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,9 @@ query_stockmarket/query_dataset/stockmarket_symboldefinition/
query_stockmarket/metadata_mean.txt
query_yelp/ground_truth_dataset/
query_yelp/manual_querycode/
query_yelp/query*/ground_truth.py
query_yelp/query*/ground_truth.py
# query_cve / query_usaspending: canonical pre-corruption snapshot + corruption
# manifest are the answer key — keep local-only. Construction code in
# manual_querycode/ IS shipped (see PROVENANCE.md) for full reproducibility.
query_cve/clean/
query_usaspending/clean/
15 changes: 15 additions & 0 deletions query_cve/db_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
db_clients:
vulns_db:
db_type: sqlite
db_path: query_dataset/vulns.db
cpe_db:
db_type: duckdb
db_path: query_dataset/cpe.duckdb
kev_db:
db_type: postgres
db_name: cve_kev
sql_file: query_dataset/kev.sql
descriptions_db:
db_type: mongo
db_name: cve_descriptions
dump_folder: query_dataset/descriptions
76 changes: 76 additions & 0 deletions query_cve/db_description.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
You are working with four databases to solve this query.

Here are the descriptions of these four databases:

1. vulns_db
- This database is stored in a SQLite database. It contains the structured NVD CVE registry: per-CVE publication metadata, CVSS v3 attack-vector information, and a sibling table holding the per-CVE CVSS score.
- This database consists of two tables:
- cves
- Core CVE registry.
- Fields:
- cve_id (str): CVE identifier (e.g. "CVE-2023-12345")
- published (str): ISO timestamp when the CVE was published
- last_modified (str): ISO timestamp of last NVD modification
- vuln_status (str): NVD workflow status (e.g. "Analyzed", "Modified")
- cvss3_attack_vector (str or null): NETWORK / ADJACENT_NETWORK / LOCAL / PHYSICAL
- cvss_metadata
- Per-CVE CVSS v3 base score.
- Fields:
- cve_id (str): CVE identifier
- score_text (str): Score and severity classification

2. cpe_db
- This database is stored in a DuckDB database. It contains the Common Platform Enumeration (CPE) match list — which products and versions are affected by each CVE — plus a vendor reference and version-details tables.
- This database consists of three tables:
- cpe_matches
- One row per (CVE, affected product configuration) pair.
- Fields:
- cve_id (str): CVE identifier
- criteria (str): Product identifier string
- vulnerable_flag (str): Indicates whether the configuration is vulnerable
- vendor_aliases
- Vendor reference table.
- Fields:
- alias (str)
- canonical_vendor (str): Canonical lowercase vendor name (e.g. "apache", "microsoft")
- cpe_version_details
- Per-(CVE, criteria) version information.
- Fields:
- cve_id (str): CVE identifier
- criteria (str): The CPE criteria string this version detail belongs to
- version_text (str or null): Affected version
- version_start_inc (str or null): Inclusive lower bound of affected versions
- version_start_exc (str or null): Exclusive lower bound of affected versions
- version_end_inc (str or null): Inclusive upper bound of affected versions
- version_end_exc (str or null): Exclusive upper bound of affected versions

3. kev_db
- This database is stored in a PostgreSQL database. It contains the CISA Known Exploited Vulnerabilities (KEV) catalog — CVEs that have been exploited in the wild.
- This database consists of one table:
- kev_entries
- One row per KEV-listed vulnerability.
- Fields:
- cve_ref (str): CVE identifier as supplied by CISA
- vendor_project (str): Vendor or project responsible for the affected product
- products_csv (str): Affected product name(s)
- vulnerability_name (str): Short human-readable name
- date_added (str): Date the CVE was added to the KEV catalog
- short_description (str): Brief description of the vulnerability
- required_action (str): Action CISA requires affected agencies to take
- due_date (str): Deadline by which required action must be completed
- known_ransomware_use (str): "Known", "Unknown", or null
- notes (str or null): Additional notes

4. descriptions_db
- This database is stored in a MongoDB database. It contains free-text descriptions and external references for each CVE.
- This database consists of one collection:
- cve_documents
- One document per CVE with embedded descriptions[] and references[].
- Fields:
- cve (str): CVE identifier
- descriptions (list of dict): Free-text descriptions in one or more languages, each with:
- language (str): ISO language code (e.g. "en", "es")
- value (str): Description prose
- references (list of dict): External reference URLs, each with:
- url (str): Reference URL
- source (str or null): Source organization that supplied the reference
12 changes: 12 additions & 0 deletions query_cve/db_description_withhint.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
HINTS:
- CVE identifiers appear under multiple surface-form variants across the four databases. Cross-DB joins on CVE id require canonicalization.
- cpe_db.cpe_matches.criteria contains a vendor alias rather than the canonical vendor name; cpe_db.vendor_aliases resolves aliases to canonical vendor names.
- cpe_db.cpe_matches.vulnerable_flag is free-text — multiple truthy and falsy tokens appear; normalization is needed.
- cpe_db.cpe_version_details.version_text is stored in a mixed encoding; equivalent versions may not match by string equality.
- kev_db.kev_entries.vendor_project may appear under multiple surface forms for the same canonical vendor; clustering is needed for grouping or counting.
- kev_db.kev_entries.products_csv may contain comma-separated lists; split to recover individual products.
- Some KEV entries reference CVEs that are not present in vulns_db.
- vulns_db.cves contains a small subset of CVEs that appear in more than one row with conflicting attribute values.
- vulns_db.cvss_metadata.score_text mixes templated and narrative encodings; the numeric score must be inferred from prose for narrative rows.
- descriptions_db.cve_documents.descriptions[].value: severity classification words do not appear literally in any English description; severity is encoded either as an opaque tagline or implied by narrative content.
- A subset of CVEs are missing the English description and only carry a non-English description.
Loading