Fully embedded distribution file dists.dss #10
Conversation
Add cmake/gen_dists.py, a Python script that runs at cmake configure time, parses dists.dss, and writes generated/dists_generated.c with: - 21 static set_member[] arrays with pre-computed cumulative weights (matching read_dist() logic exactly — no runtime parsing) - load_dists() that assigns pointers only: no malloc, no fopen, fully idempotent regardless of how many times it is called - dbgen_reset_seeds() (previously in tpch_init.c) CMakeLists.txt now runs the generator via execute_process() before add_subdirectory(third_party/dbgen) and registers CMAKE_CONFIGURE_DEPENDS on both dists.dss and gen_dists.py so cmake re-runs automatically when either file changes. third_party/dbgen/CMakeLists.txt replaces tpch_init.c with the generated file in the dbgen_objs source list. The dists.dss configure_file() copy is kept for users who invoke raw dbgen tools from the build directory, but the benchmark no longer requires the file to be present at runtime. Verified: all 8 TPC-H tables at SF=1 (--max-rows 0) produce correct row counts with dists.dss removed from the build directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The compile-time embedding removed the ability to point load_dists() at a custom dists.dss file via the DSS_DIST environment variable. Restore it: if DSS_DIST is set, load_dists() sets the d_path global to the env-var value and calls read_dist() for every distribution, then restores d_path to NULL. Using d_path (rather than passing env_dist as the path argument) is the correct mechanism: read_dist() opens d_path directly when it is non-NULL, bypassing the CONFIG_DFLT prefix logic that would otherwise turn an absolute path into ".//abs/...". When DSS_DIST is not set the function falls through to the embedded static arrays as before — no file I/O, no malloc. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR embeds all TPC-H distribution data from dists.dss as static C arrays at CMake configure time, eliminating runtime file I/O and heap allocation for distribution loading. A Python script (cmake/gen_dists.py) parses dists.dss, pre-computes cumulative weights, and generates a C source file (dists_generated.c) that replaces the former tpch_init.c.
Changes:
- Add
cmake/gen_dists.py— a configure-time code generator that parsesdists.dssand emits 21 staticset_member[]arrays plusload_dists()anddbgen_reset_seeds()implementations - Update root
CMakeLists.txtto add afind_package(Python3 REQUIRED), run the generator viaexecute_process(), and registerCMAKE_CONFIGURE_DEPENDSon both input files - Update
third_party/dbgen/CMakeLists.txtto replacetpch_init.cwith the generateddists_generated.cin thedbgen_objssource list
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
cmake/gen_dists.py |
New Python script that parses dists.dss and generates the embedded C distribution data |
CMakeLists.txt |
Adds Python3 dependency, runs the generator at configure time, registers file dependencies |
third_party/dbgen/CMakeLists.txt |
Replaces tpch_init.c source with the generated dists_generated.c |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add cmake/gen_dists.py, a Python script that runs at cmake configure time, parses dists.dss, and writes generated/dists_generated.c with:
CMakeLists.txt now runs the generator via execute_process() before add_subdirectory(third_party/dbgen) and registers CMAKE_CONFIGURE_DEPENDS on both dists.dss and gen_dists.py so cmake re-runs automatically when either file changes.
third_party/dbgen/CMakeLists.txt replaces tpch_init.c with the generated file in the dbgen_objs source list.
The dists.dss configure_file() copy is kept for users who invoke raw dbgen tools from the build directory, but the benchmark no longer requires the file to be present at runtime.
Verified: all 8 TPC-H tables at SF=1 (--max-rows 0) produce correct row counts with dists.dss removed from the build directory.