Skip to content

Adding tpcds benchmark#12

Open
tsafin wants to merge 2 commits intomasterfrom
tpcds_cpp_embedded
Open

Adding tpcds benchmark#12
tsafin wants to merge 2 commits intomasterfrom
tpcds_cpp_embedded

Conversation

@tsafin
Copy link
Owner

@tsafin tsafin commented Mar 6, 2026

  • Modified third_party/tpcds for embedding scenarios
  • Create wrappers for several first tables

tsafin and others added 2 commits March 6, 2026 23:30
Adds a complete tpcds_benchmark executable (built with -DTPCDS_ENABLE=ON)
that generates TPC-DS store_sales and inventory tables into Parquet, CSV,
ORC, Lance, Paimon, or Iceberg via the existing tpch::WriterInterface.

Build infrastructure (third_party/dsdgen/):
- CMakeLists.txt: mkheader → tables.h/streams.h/columns.h → distcomp →
  tpcds.idx → gen_dsts.py → dsts_generated.c → dsdgen_objs object library
- dsdgen_stubs.c: thin file-I/O stubs (not needed in embedded mode)
- tpcds_dsdgen.h: C++-safe forward declarations that bypass the fragile
  config.h/porting.h LINUX/HUGE_TYPE dependency chain; defines ds_key_t,
  decimal_t, ds_pricing_t, W_STORE_SALES_TBL, W_INVENTORY_TBL, and the
  EMBEDDED_DSDGEN callback globals
- cmake/gen_dsts.py: embeds tpcds.idx as a C byte array (dsts_generated.c)

C++ wrappers (include/tpch/, src/dsdgen/):
- dsdgen_wrapper.hpp/.cpp: DSDGenWrapper class — initialises dsdgen from
  the embedded tpcds.idx (via mkstemp temp file), exposes Arrow schemas
  for STORE_SALES and INVENTORY, and drives generation via the
  EMBEDDED_DSDGEN callback trampoline (store_sales is master-detail:
  8-16 line items per ticket; callback fires once per line item)
- dsdgen_converter.hpp/.cpp: append_*_to_builders() helpers mapping the
  dsdgen C structs to Arrow array builders; dec_to_double() correctly
  converts decimal_t scaled integers (avoids buggy upstream dectoflt)

Executable (src/tpcds_main.cpp):
- CLI mirrors tpch_benchmark: --format, --table, --scale-factor,
  --output-dir, --max-rows, --verbose
- Batched Arrow generation loop (10 000 rows/batch) → writer->write_batch

third_party/tpcds submodule updated to branch tpcds_cpp_embedded which
adds EMBEDDED_DSDGEN guards to w_store_sales.c (callback + suppressed
file I/O for both store_sales and store_returns output files).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hers)

Implements the following tables in tpcds_benchmark:
  catalog_sales  — master-detail (callback-based, 4-14 line items/ticket)
  web_sales      — master-detail (callback-based, 8-16 line items/ticket)
  customer       — dimension table (simple, 100K rows at SF=1)
  item           — dimension table (simple, ~18K rows at SF=1)
  date_dim       — dimension table (simple, 73,049 fixed rows)
  store_returns  — driven through store_sales ticket loop
  catalog_returns — driven through catalog_sales ticket loop
  web_returns    — driven through web_sales ticket loop

Changes:
- tpcds_dsdgen.h: added struct definitions for all 8 new table row types,
  table ID constants, mk_w_* function declarations, and EMBEDDED_DSDGEN
  extern callback declarations for catalog_sales and web_sales
- dsdgen_wrapper.hpp: added generate_* method declarations for 8 tables
- dsdgen_wrapper.cpp: Arrow schemas, trampolines (catalog/web sales),
  generate_* implementations; returns tables are driven through their
  parent sales loop using a no-op callback to suppress file output while
  populating the global sales struct for valid return references
- dsdgen_converter.hpp/.cpp: Arrow builder converters for all 8 tables;
  updated append_dsdgen_row_to_builders() dispatcher
- tpcds_main.cpp: updated parse_table(), generation dispatch, and
  usage text to reflect Phase 3 completion
- third_party/tpcds submodule: updated to include w_catalog_sales.c
  and w_web_sales.c EMBEDDED_DSDGEN callback patches

All 10 implemented tables verified: correct row counts, sane values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin tsafin requested a review from Copilot March 6, 2026 22:19
@tsafin
Copy link
Owner Author

tsafin commented Mar 6, 2026

@codex review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds TPC-DS data generation support by vendoring the upstream tools, embedding distribution data, and exposing a new tpcds_benchmark executable.

Changes:

  • Adds a third_party/tpcds submodule and a CMake build for compiling/embedding TPC-DS dsdgen.
  • Introduces a C++ DSDGenWrapper + Arrow conversion layer for several initial TPC-DS tables.
  • Adds a TPCDS_ENABLE build flag and wires a new tpcds_benchmark target into the top-level build.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
third_party/tpcds Adds the TPC-DS tools submodule pin for dsdgen sources.
third_party/dsdgen/tpcds_dsdgen.h Provides a C++-friendly API surface for calling into dsdgen objects.
third_party/dsdgen/dsdgen_stubs.c Placeholder TU for embed-mode stubs (currently comment-only).
third_party/dsdgen/CMakeLists.txt Builds mkheader, distcomp, embeds tpcds.idx, and compiles dsdgen as objects.
src/tpcds_main.cpp New tpcds_benchmark CLI that generates Arrow batches and writes using existing writer backends.
src/dsdgen/dsdgen_wrapper.cpp Initializes embedded distributions + exposes per-table generation APIs with callbacks.
src/dsdgen/dsdgen_converter.cpp Converts dsdgen structs into Arrow builders for supported tables.
include/tpch/dsdgen_wrapper.hpp Public wrapper API + table enum + schema helpers.
include/tpch/dsdgen_converter.hpp Public conversion function declarations.
cmake/gen_dsts.py New generator to embed tpcds.idx as a uint8_t[] C array.
CMakeLists.txt Adds TPCDS_ENABLE option and the tpcds_benchmark target when enabled.
.gitmodules Registers the new third_party/tpcds submodule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

${DSDGEN_SOURCE_DIR}/s_web_order.c
${DSDGEN_SOURCE_DIR}/s_web_order_lineitem.c
${DSDGEN_SOURCE_DIR}/s_web_page.c
${DSDGEN_SOURCE_DIR}/s_web_promotinal_item.c
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This source filename looks misspelled (promotinal). If the upstream file is actually s_web_promotional_item.c (common in TPC-DS dsdgen trees), this will fail the build at configure/generate time. Update the entry to the correct upstream filename (or gate it behind an existence check) so the dsdgen object library can be built reliably.

Suggested change
${DSDGEN_SOURCE_DIR}/s_web_promotinal_item.c
${DSDGEN_SOURCE_DIR}/s_web_promotional_item.c

Copilot uses AI. Check for mistakes.
std::vector<std::shared_ptr<arrow::Array>> arrays;
arrays.reserve(schema->num_fields());
for (const auto& field : schema->fields()) {
arrays.push_back(builders[field->name()]->Finish().ValueOrDie());
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow::ArrayBuilder::Finish() is typically a Status API that writes into an out-parameter (and does not return a Result), so Finish().ValueOrDie() may not compile depending on the Arrow version. Consider switching to the Status Finish(std::shared_ptr<arrow::Array>* out) pattern (and handling Status) to ensure compatibility and avoid terminating the process on failure.

Suggested change
arrays.push_back(builders[field->name()]->Finish().ValueOrDie());
auto builder_it = builders.find(field->name());
if (builder_it == builders.end()) {
throw std::runtime_error("No Arrow builder found for field: " + field->name());
}
std::shared_ptr<arrow::Array> array;
arrow::Status st = builder_it->second->Finish(&array);
if (!st.ok()) {
throw std::runtime_error(
"Failed to finish Arrow builder for field '" + field->name() +
"': " + st.ToString());
}
arrays.push_back(array);

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +37
static inline double dec_to_double(const decimal_t* d) {
if (d->precision == 0) return static_cast<double>(d->number);
double result = static_cast<double>(d->number);
for (int i = 0; i < d->precision; ++i) {
result /= 10.0;
}
return result;
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPC-DS decimal_t carries both precision and scale. The number of fractional digits is typically represented by scale, not precision. Dividing by 10^precision will produce incorrect values for fields where precision != scale. Update the conversion to divide by 10^scale (and keep precision for total digits metadata only).

Copilot uses AI. Check for mistakes.
/**
* dsdgen_converter.cpp — Convert dsdgen C structs to Arrow array builders.
*
* Uses dectof() to convert decimal_t (scaled integer) fields to double.
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references dectof(), but the implementation uses dec_to_double(). Updating this comment will avoid confusion for future maintainers and makes it easier to grep/cross-reference the conversion logic.

Suggested change
* Uses dectof() to convert decimal_t (scaled integer) fields to double.
* Uses dec_to_double() to convert decimal_t (scaled integer) fields to double.

Copilot uses AI. Check for mistakes.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b0d16b9ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +509 to +512
StoreSalesCtx ctx{&callback, max_rows, 0L};
g_w_store_sales_callback = store_sales_trampoline;
g_w_store_sales_callback_ctx = &ctx;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear global callback pointers on exceptional exits

generate_store_sales stores a pointer to stack-local ctx in global callback state, but that state is only reset on the normal path after the loop. If the user callback throws (for example, run_generation calling writer->write_batch can throw on write failures), control exits before cleanup and g_w_store_sales_callback_ctx keeps a dangling pointer, so a later dsdgen invocation can dereference freed stack memory and crash or corrupt generation state; the same pattern is present in generate_catalog_sales and generate_web_sales.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants