Skip to content

Fully embedded distribution file dists.dss #10

Merged
tsafin merged 2 commits intomasterfrom
tsafin/embedded_dss
Mar 3, 2026
Merged

Fully embedded distribution file dists.dss #10
tsafin merged 2 commits intomasterfrom
tsafin/embedded_dss

Conversation

@tsafin
Copy link
Owner

@tsafin tsafin commented Mar 3, 2026

Add cmake/gen_dists.py, a Python script that runs at cmake configure time, parses dists.dss, and writes generated/dists_generated.c with:

  • 21 static set_member[] arrays with pre-computed cumulative weights (matching read_dist() logic exactly — no runtime parsing)
  • load_dists() that assigns pointers only: no malloc, no fopen, fully idempotent regardless of how many times it is called
  • dbgen_reset_seeds() (previously in tpch_init.c)

CMakeLists.txt now runs the generator via execute_process() before add_subdirectory(third_party/dbgen) and registers CMAKE_CONFIGURE_DEPENDS on both dists.dss and gen_dists.py so cmake re-runs automatically when either file changes.

third_party/dbgen/CMakeLists.txt replaces tpch_init.c with the generated file in the dbgen_objs source list.

The dists.dss configure_file() copy is kept for users who invoke raw dbgen tools from the build directory, but the benchmark no longer requires the file to be present at runtime.

Verified: all 8 TPC-H tables at SF=1 (--max-rows 0) produce correct row counts with dists.dss removed from the build directory.

Add cmake/gen_dists.py, a Python script that runs at cmake configure
time, parses dists.dss, and writes generated/dists_generated.c with:
- 21 static set_member[] arrays with pre-computed cumulative weights
  (matching read_dist() logic exactly — no runtime parsing)
- load_dists() that assigns pointers only: no malloc, no fopen,
  fully idempotent regardless of how many times it is called
- dbgen_reset_seeds() (previously in tpch_init.c)

CMakeLists.txt now runs the generator via execute_process() before
add_subdirectory(third_party/dbgen) and registers CMAKE_CONFIGURE_DEPENDS
on both dists.dss and gen_dists.py so cmake re-runs automatically when
either file changes.

third_party/dbgen/CMakeLists.txt replaces tpch_init.c with the generated
file in the dbgen_objs source list.

The dists.dss configure_file() copy is kept for users who invoke raw
dbgen tools from the build directory, but the benchmark no longer
requires the file to be present at runtime.

Verified: all 8 TPC-H tables at SF=1 (--max-rows 0) produce correct
row counts with dists.dss removed from the build directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin tsafin requested a review from Copilot March 3, 2026 22:40
@tsafin
Copy link
Owner Author

tsafin commented Mar 3, 2026

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The compile-time embedding removed the ability to point load_dists() at
a custom dists.dss file via the DSS_DIST environment variable.

Restore it: if DSS_DIST is set, load_dists() sets the d_path global to
the env-var value and calls read_dist() for every distribution, then
restores d_path to NULL.  Using d_path (rather than passing env_dist as
the path argument) is the correct mechanism: read_dist() opens d_path
directly when it is non-NULL, bypassing the CONFIG_DFLT prefix logic
that would otherwise turn an absolute path into ".//abs/...".

When DSS_DIST is not set the function falls through to the embedded
static arrays as before — no file I/O, no malloc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR embeds all TPC-H distribution data from dists.dss as static C arrays at CMake configure time, eliminating runtime file I/O and heap allocation for distribution loading. A Python script (cmake/gen_dists.py) parses dists.dss, pre-computes cumulative weights, and generates a C source file (dists_generated.c) that replaces the former tpch_init.c.

Changes:

  • Add cmake/gen_dists.py — a configure-time code generator that parses dists.dss and emits 21 static set_member[] arrays plus load_dists() and dbgen_reset_seeds() implementations
  • Update root CMakeLists.txt to add a find_package(Python3 REQUIRED), run the generator via execute_process(), and register CMAKE_CONFIGURE_DEPENDS on both input files
  • Update third_party/dbgen/CMakeLists.txt to replace tpch_init.c with the generated dists_generated.c in the dbgen_objs source list

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
cmake/gen_dists.py New Python script that parses dists.dss and generates the embedded C distribution data
CMakeLists.txt Adds Python3 dependency, runs the generator at configure time, registers file dependencies
third_party/dbgen/CMakeLists.txt Replaces tpch_init.c source with the generated dists_generated.c

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tsafin tsafin merged commit 1cf293c into master Mar 3, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants