Conversation
tsafin
commented
Mar 6, 2026
- Modified third_party/tpcds for embedding scenarios
- Create wrappers for several first tables
Adds a complete tpcds_benchmark executable (built with -DTPCDS_ENABLE=ON) that generates TPC-DS store_sales and inventory tables into Parquet, CSV, ORC, Lance, Paimon, or Iceberg via the existing tpch::WriterInterface. Build infrastructure (third_party/dsdgen/): - CMakeLists.txt: mkheader → tables.h/streams.h/columns.h → distcomp → tpcds.idx → gen_dsts.py → dsts_generated.c → dsdgen_objs object library - dsdgen_stubs.c: thin file-I/O stubs (not needed in embedded mode) - tpcds_dsdgen.h: C++-safe forward declarations that bypass the fragile config.h/porting.h LINUX/HUGE_TYPE dependency chain; defines ds_key_t, decimal_t, ds_pricing_t, W_STORE_SALES_TBL, W_INVENTORY_TBL, and the EMBEDDED_DSDGEN callback globals - cmake/gen_dsts.py: embeds tpcds.idx as a C byte array (dsts_generated.c) C++ wrappers (include/tpch/, src/dsdgen/): - dsdgen_wrapper.hpp/.cpp: DSDGenWrapper class — initialises dsdgen from the embedded tpcds.idx (via mkstemp temp file), exposes Arrow schemas for STORE_SALES and INVENTORY, and drives generation via the EMBEDDED_DSDGEN callback trampoline (store_sales is master-detail: 8-16 line items per ticket; callback fires once per line item) - dsdgen_converter.hpp/.cpp: append_*_to_builders() helpers mapping the dsdgen C structs to Arrow array builders; dec_to_double() correctly converts decimal_t scaled integers (avoids buggy upstream dectoflt) Executable (src/tpcds_main.cpp): - CLI mirrors tpch_benchmark: --format, --table, --scale-factor, --output-dir, --max-rows, --verbose - Batched Arrow generation loop (10 000 rows/batch) → writer->write_batch third_party/tpcds submodule updated to branch tpcds_cpp_embedded which adds EMBEDDED_DSDGEN guards to w_store_sales.c (callback + suppressed file I/O for both store_sales and store_returns output files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hers) Implements the following tables in tpcds_benchmark: catalog_sales — master-detail (callback-based, 4-14 line items/ticket) web_sales — master-detail (callback-based, 8-16 line items/ticket) customer — dimension table (simple, 100K rows at SF=1) item — dimension table (simple, ~18K rows at SF=1) date_dim — dimension table (simple, 73,049 fixed rows) store_returns — driven through store_sales ticket loop catalog_returns — driven through catalog_sales ticket loop web_returns — driven through web_sales ticket loop Changes: - tpcds_dsdgen.h: added struct definitions for all 8 new table row types, table ID constants, mk_w_* function declarations, and EMBEDDED_DSDGEN extern callback declarations for catalog_sales and web_sales - dsdgen_wrapper.hpp: added generate_* method declarations for 8 tables - dsdgen_wrapper.cpp: Arrow schemas, trampolines (catalog/web sales), generate_* implementations; returns tables are driven through their parent sales loop using a no-op callback to suppress file output while populating the global sales struct for valid return references - dsdgen_converter.hpp/.cpp: Arrow builder converters for all 8 tables; updated append_dsdgen_row_to_builders() dispatcher - tpcds_main.cpp: updated parse_table(), generation dispatch, and usage text to reflect Phase 3 completion - third_party/tpcds submodule: updated to include w_catalog_sales.c and w_web_sales.c EMBEDDED_DSDGEN callback patches All 10 implemented tables verified: correct row counts, sane values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Adds TPC-DS data generation support by vendoring the upstream tools, embedding distribution data, and exposing a new tpcds_benchmark executable.
Changes:
- Adds a
third_party/tpcdssubmodule and a CMake build for compiling/embedding TPC-DSdsdgen. - Introduces a C++
DSDGenWrapper+ Arrow conversion layer for several initial TPC-DS tables. - Adds a
TPCDS_ENABLEbuild flag and wires a newtpcds_benchmarktarget into the top-level build.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/tpcds | Adds the TPC-DS tools submodule pin for dsdgen sources. |
| third_party/dsdgen/tpcds_dsdgen.h | Provides a C++-friendly API surface for calling into dsdgen objects. |
| third_party/dsdgen/dsdgen_stubs.c | Placeholder TU for embed-mode stubs (currently comment-only). |
| third_party/dsdgen/CMakeLists.txt | Builds mkheader, distcomp, embeds tpcds.idx, and compiles dsdgen as objects. |
| src/tpcds_main.cpp | New tpcds_benchmark CLI that generates Arrow batches and writes using existing writer backends. |
| src/dsdgen/dsdgen_wrapper.cpp | Initializes embedded distributions + exposes per-table generation APIs with callbacks. |
| src/dsdgen/dsdgen_converter.cpp | Converts dsdgen structs into Arrow builders for supported tables. |
| include/tpch/dsdgen_wrapper.hpp | Public wrapper API + table enum + schema helpers. |
| include/tpch/dsdgen_converter.hpp | Public conversion function declarations. |
| cmake/gen_dsts.py | New generator to embed tpcds.idx as a uint8_t[] C array. |
| CMakeLists.txt | Adds TPCDS_ENABLE option and the tpcds_benchmark target when enabled. |
| .gitmodules | Registers the new third_party/tpcds submodule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ${DSDGEN_SOURCE_DIR}/s_web_order.c | ||
| ${DSDGEN_SOURCE_DIR}/s_web_order_lineitem.c | ||
| ${DSDGEN_SOURCE_DIR}/s_web_page.c | ||
| ${DSDGEN_SOURCE_DIR}/s_web_promotinal_item.c |
There was a problem hiding this comment.
This source filename looks misspelled (promotinal). If the upstream file is actually s_web_promotional_item.c (common in TPC-DS dsdgen trees), this will fail the build at configure/generate time. Update the entry to the correct upstream filename (or gate it behind an existence check) so the dsdgen object library can be built reliably.
| ${DSDGEN_SOURCE_DIR}/s_web_promotinal_item.c | |
| ${DSDGEN_SOURCE_DIR}/s_web_promotional_item.c |
| std::vector<std::shared_ptr<arrow::Array>> arrays; | ||
| arrays.reserve(schema->num_fields()); | ||
| for (const auto& field : schema->fields()) { | ||
| arrays.push_back(builders[field->name()]->Finish().ValueOrDie()); |
There was a problem hiding this comment.
arrow::ArrayBuilder::Finish() is typically a Status API that writes into an out-parameter (and does not return a Result), so Finish().ValueOrDie() may not compile depending on the Arrow version. Consider switching to the Status Finish(std::shared_ptr<arrow::Array>* out) pattern (and handling Status) to ensure compatibility and avoid terminating the process on failure.
| arrays.push_back(builders[field->name()]->Finish().ValueOrDie()); | |
| auto builder_it = builders.find(field->name()); | |
| if (builder_it == builders.end()) { | |
| throw std::runtime_error("No Arrow builder found for field: " + field->name()); | |
| } | |
| std::shared_ptr<arrow::Array> array; | |
| arrow::Status st = builder_it->second->Finish(&array); | |
| if (!st.ok()) { | |
| throw std::runtime_error( | |
| "Failed to finish Arrow builder for field '" + field->name() + | |
| "': " + st.ToString()); | |
| } | |
| arrays.push_back(array); |
| static inline double dec_to_double(const decimal_t* d) { | ||
| if (d->precision == 0) return static_cast<double>(d->number); | ||
| double result = static_cast<double>(d->number); | ||
| for (int i = 0; i < d->precision; ++i) { | ||
| result /= 10.0; | ||
| } | ||
| return result; | ||
| } |
There was a problem hiding this comment.
TPC-DS decimal_t carries both precision and scale. The number of fractional digits is typically represented by scale, not precision. Dividing by 10^precision will produce incorrect values for fields where precision != scale. Update the conversion to divide by 10^scale (and keep precision for total digits metadata only).
| /** | ||
| * dsdgen_converter.cpp — Convert dsdgen C structs to Arrow array builders. | ||
| * | ||
| * Uses dectof() to convert decimal_t (scaled integer) fields to double. |
There was a problem hiding this comment.
The comment references dectof(), but the implementation uses dec_to_double(). Updating this comment will avoid confusion for future maintainers and makes it easier to grep/cross-reference the conversion logic.
| * Uses dectof() to convert decimal_t (scaled integer) fields to double. | |
| * Uses dec_to_double() to convert decimal_t (scaled integer) fields to double. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b0d16b9ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| StoreSalesCtx ctx{&callback, max_rows, 0L}; | ||
| g_w_store_sales_callback = store_sales_trampoline; | ||
| g_w_store_sales_callback_ctx = &ctx; | ||
|
|
There was a problem hiding this comment.
Clear global callback pointers on exceptional exits
generate_store_sales stores a pointer to stack-local ctx in global callback state, but that state is only reset on the normal path after the loop. If the user callback throws (for example, run_generation calling writer->write_batch can throw on write failures), control exits before cleanup and g_w_store_sales_callback_ctx keeps a dangling pointer, so a later dsdgen invocation can dereference freed stack memory and crash or corrupt generation state; the same pattern is present in generate_catalog_sales and generate_web_sales.
Useful? React with 👍 / 👎.