Lance streaming optimization: pre-computed stats or cardinality sampling by tsafin · Pull Request #7 · tsafin/tpch-cpp

tsafin · 2026-02-23T11:47:52Z

Summary

Phase 3.1: Add --lance-cardinality-sample-rate CLI option to sample a fraction of values for HyperLogLogPlus cardinality estimation, reducing CPU overhead while preserving stats quality
Phase 3.2: Inject pre-computed TPC-H cardinalities via Arrow field metadata "lance.cardinality", bypassing HyperLogLogPlus entirely for bounded-cardinality columns
Lance FFI: Full rewrite of lance_writer with proper Arrow C Data Interface, null bitmap handling, and streaming write support
Benchmark results: SF=5 lineitem 323–426K rows/sec vs 217K baseline (+49–96%)

Key Changes

Lance encoding (bundled `third_party/lance`)

statistics.rs: CARDINALITY_HINT thread-local + set_cardinality_hint() public API; VariableWidthBlock::cardinality() returns pre-computed value when hint is set, skipping all hash operations
statistics.rs: LCG sampling with u64 threshold (fixes f32 precision bug at rate=0.2)
encodings/logical/primitive.rs: reads field.metadata["lance.cardinality"] before DataBlock::from_arrays() and sets hint

TPC-H schemas (`src/dbgen/dbgen_wrapper.cpp`)

tpch_field() helper attaches "lance.cardinality" metadata to known-cardinality string columns
Annotated columns per TPC-H spec / dists.dss: l_returnflag(3), l_linestatus(2), l_shipinstruct(4), l_shipmode(7), date fields(2556), o_orderstatus(3), o_orderpriority(5), c_mktsegment(5), p_mfgr(5), p_brand(25), p_type(150), p_container(40), etc.

Lance FFI (`third_party/lance-ffi/src/lib.rs`, `src/writers/lance_writer.cpp`)

Streaming write path using RecordBatchReader channel
Proper null bitmap and Arrow C Data Interface ownership semantics
Support for Float32, Date32, Boolean, nullable arrays

Benchmarked all formats (Parquet, ORC, CSV, Lance, Paimon, Iceberg) at SF=5 with zero-copy optimization. Key findings: - ORC wins: 905K rows/sec average (17% faster than Parquet) - Parquet: 771K rows/sec, excellent stability - Lance/Paimon/Iceberg: High variance (60-177%), not production-ready - CSV: 235K rows/sec, I/O bound Added: - Comprehensive benchmark table with maximum values in README - Performance and stability visualization charts - Benchmark scripts for automated testing and visualization - Detailed analysis reports in benchmark-results/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Includes two Lance commits: - 381feaff: light stats/compression tuning and fixed-width single-array zero-copy path - caac148a: binary/utf8 single-array zero-copy path when offsets are normalized

Pulls Lance commit ddf0ee5d to add safety-aware array retention in the encoder accumulation queue without env overrides.

Phase 3.1 - Cardinality sampling via CLI: - Add --lance-cardinality-sample-rate <0.01-1.0> option to control what fraction of values are fed into HyperLogLogPlus cardinality estimation - Sets LANCE_CARDINALITY_SAMPLE_RATE env var before Lance writer init - Reduces HLL CPU overhead at the cost of minor cardinality accuracy Phase 3.2 - Pre-computed TPC-H cardinality hints: - Add tpch_field() helper in dbgen_wrapper.cpp that attaches "lance.cardinality" Arrow field metadata for known-cardinality columns - Lance reads this metadata and skips HyperLogLogPlus entirely for those columns, returning the exact pre-computed cardinality instead - Lineitem: l_returnflag(3), l_linestatus(2), l_shipinstruct(4), l_shipmode(7), date fields(2556); orders/customer/part similarly annotated - Measured result: SF=5 lineitem 323-426K rows/sec vs 217K baseline (+49-96%) Update bundled Lance: pre-computed cardinality hints and LCG sampling fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Points to https://github.com/tsafin/lance.git (tsafin/lance_stream branch) which contains our custom optimizations: - LCG cardinality sampling (Phase 3.1) - Pre-computed cardinality hints via Arrow field metadata (Phase 3.2) - Light-stats mode, zero-copy fast paths, auto-retention streaming Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docker-images.yml: - Add type=ref,event=branch so every docker build also publishes a stable branch tag (e.g. tsafin-lance_stream) alongside the SHA-pinned tag. :latest is still only pushed on the default branch (master). ci.yml: - Add resolve-images job that sanitizes the current branch name, checks whether a branch-specific image exists in GHCR (docker manifest inspect), and outputs the correct image tag per config (base/orc/lance). Falls back to :latest when no branch-specific image is available. - build-matrix, benchmark-suite, optimization-benchmarks all now use needs.resolve-images.outputs.{base,orc,lance}_image instead of :latest. This fixes CI builds on PR branches where third_party/lance-ffi has changed: the branch image contains the freshly compiled liblance_ffi.a while :latest still carries the master version with potentially missing FFI symbols. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The resolve-images job was incorrectly converting underscores to dashes, producing 'tsafin-lance-stream-' instead of the actual published tag 'tsafin-lance_stream'. docker/metadata-action type=ref,event=branch only replaces '/' with '-' and lowercases; underscores are preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace utf8() Arrow arrays with dictionary(int8(), utf8()) for 10 columns with known-bounded cardinality from TPC-H spec/dists.dss. Lance routes these to DictionaryDataBlock where compute_stat() is a literal no-op {} — zero HLL/XXH3 overhead vs Phase 3.2's cardinality-hint approach. Columns converted (10 total): lineitem: returnflag(3), linestatus(2), shipinstruct(4), shipmode(7) orders: orderstatus(3), orderpriority(5) customer: mktsegment(5) part: mfgr(5), brand(25), container(40) Encode functions use O(1) switch on first character(s) — no hashing, no strcmp needed. Both conversion paths updated: zero-copy batch path (zero_copy_converter.cpp) and row-by-row path (dbgen_converter.cpp). lance-ffi/lib.rs: apply_compression_metadata() now skips dict fields (Lance handles DictionaryDataBlock encoding internally). Results SF=5 lineitem (30M rows, clean runs): Phase 3.2 baseline: ~410-440K r/s Phase 3.3: 645-648K r/s (+57% over 3.2, +197% vs 217K original) Stability: <0.5% variance (much better than 3.2) File size: 4.0 GB vs 4.24 GB (-5.7%) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eliminate remaining VariableWidthBlock::compute_stat (HLL/XXH3) overhead by adding scale-factor-derived cardinality hints to all high-cardinality utf8 columns. Perf profile showed 17.7% of samples in compute_stat for l_comment and similar fields that had no hint — now all utf8 fields skip HLL computation. get_schema() now accepts scale_factor (default 1.0) and sets hints: - Comment fields (l/o/c/p/ps/s_comment): hint = table_row_count * SF - High-cardinality name/address/phone fields: hint = table_row_count * SF - o_clerk: hint = 1000 * SF (Clerk#XXXXXXXXX, spec-defined cardinality) - nation/region comments: small fixed counts (25, 5) All dict8 columns already had zero-overhead compute_stat (no-op). Also: Tokio runtime tuned for streaming path (worker_threads(1) vs default multi-thread N-worker pool) to reduce idle thread overhead. Benchmark SF=5 lineitem (non-zero-copy), window peaks up to 922K r/s vs ~650K before this phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tsafin · 2026-02-23T23:12:01Z

@copilot review

Copilot · 2026-02-23T23:12:09Z

@tsafin I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

tsafin · 2026-02-23T23:15:23Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1afbdb8a46

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/dbgen/dbgen_wrapper.cpp

src/writers/lance_writer.cpp

…ule) Update third_party/lance submodule to include Phase 3.5 changes: - previous::PrimitiveFieldEncoder gets known_cardinality cached at encoder creation; set_cardinality_hint() called in each tokio encode task so VariableWidthBlock::cardinality() skips HyperLogLogPlus entirely. - PrimitiveStructuralEncoder same pattern for V2.1+ consistency. Perf result: compute_stat 17.7% → ~0.05%, window peaks 300K → 1.7M+ r/s on SF=5 lineitem zero-copy path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Arrow dict8<int8, utf8> arrays (used for low-cardinality TPC-H columns since Phase 3.3) were not handled in non-Lance writers: - csv_writer: fell back to array->ToString() producing corrupt column representation per cell; fix does index→value lookup via GetString() - orc_writer: arrow_type_to_orc_type_name() threw "Unsupported type" on DICTIONARY; copy_array_to_orc_column() also threw; fix maps DICTIONARY to ORC "string" and expands indices via GetView() into dict buffer - paimon_writer: arrow_type_to_paimon_type() threw on DICTIONARY; fix maps to "string" (data path via parquet::arrow::WriteTable is fine) - iceberg_writer: arrow_type_to_iceberg_type() threw on DICTIONARY; fix maps to "string" (data path via parquet::arrow::WriteTable is fine) Parquet and Lance handle DICTIONARY natively through Arrow — no changes needed. ORC fix tested via code review only (requires -DTPCH_ENABLE_ORC). CSV fix verified: l_returnflag/linestatus/shipinstruct/shipmode output correct string values ("N", "O", "DELIVER IN PERSON", "TRUCK"). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the Rust streaming task exits early (disk error, write failure), push() would block indefinitely: the queue stays full, nothing notifies not_full_cv_, and the C++ producer hangs forever. Fix: - Add lance_writer_stream_is_alive() Rust FFI: polls JoinHandle::is_finished() to detect whether the background task is still consuming the stream. - Add StreamState::set_health_check(fn): stores a callback invoked when a push() timeout fires without a real queue-drain notification. - Rewrite the stall wait from bare wait() to a wait_for(500ms) loop: on each timeout, call the health check; if the task is gone, call set_error() which unblocks both not_full_cv_ and not_empty_cv_. - Wire up in initialize_lance_dataset(): register a lambda that calls lance_writer_stream_is_alive(rust_writer_) after stream start. The 500 ms poll interval means producers unblock within half a second of a Rust task crash, rather than hanging until process kill. Normal operation (no task failure) adds zero overhead: the health check is only reached on a wait_for timeout, which only fires when the queue stays full for 500 ms without a pop — i.e., already a stall event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs identified by copilot-swe-agent in PR #8: - lib.rs: fix memory leak on ArrowArrayStreamReader::from_raw() failure. The original code called libc::free() after the ? early-return, so the malloc'd ArrowArrayStream struct leaked if from_raw() failed. Fix: free unconditionally before propagating the error. - lance_writer.cpp: remove dead no-op else branch in StreamState::push(). The else path (queue not full) called not_full_cv_.wait() with a predicate already satisfied, making it a guaranteed immediate return. Removing it clarifies the backpressure logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tsafin and others added 5 commits February 20, 2026 21:25

Improve Lance streaming throughput and metrics

d3f562b

Update bundled Lance: light-stats tuning + zero-copy encode fast paths

415bca2

Includes two Lance commits: - 381feaff: light stats/compression tuning and fixed-width single-array zero-copy path - caac148a: binary/utf8 single-array zero-copy path when offsets are normalized

Update bundled Lance: auto accumulation retention mode

8c528cd

Pulls Lance commit ddf0ee5d to add safety-aware array retention in the encoder accumulation queue without env overrides.

tsafin changed the title ~~Lance streaming write optimization: pre-computed stats and cardinality sampling~~ Lance streaming optimization: pre-computed stats or cardinality sampling Feb 23, 2026

tsafin and others added 5 commits February 23, 2026 14:53

Copilot AI mentioned this pull request Feb 23, 2026

Fix memory leak and dead no-op wait in Lance streaming path #8

Closed

chatgpt-codex-connector bot reviewed Feb 23, 2026

View reviewed changes

src/dbgen/dbgen_wrapper.cpp Show resolved Hide resolved

src/writers/lance_writer.cpp Outdated Show resolved Hide resolved

tsafin and others added 4 commits February 24, 2026 03:05

tsafin merged commit b16981d into master Feb 24, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance streaming optimization: pre-computed stats or cardinality sampling#7

Lance streaming optimization: pre-computed stats or cardinality sampling#7
tsafin merged 14 commits intomasterfrom
tsafin/lance_stream

tsafin commented Feb 23, 2026 •

edited

Loading

Uh oh!

tsafin commented Feb 23, 2026

Uh oh!

Copilot AI commented Feb 23, 2026

Uh oh!

tsafin commented Feb 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsafin commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Lance encoding (bundled third_party/lance)

TPC-H schemas (src/dbgen/dbgen_wrapper.cpp)

Lance FFI (third_party/lance-ffi/src/lib.rs, src/writers/lance_writer.cpp)

Uh oh!

tsafin commented Feb 23, 2026

Uh oh!

Copilot AI commented Feb 23, 2026

Uh oh!

tsafin commented Feb 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsafin commented Feb 23, 2026 •

edited

Loading

Lance encoding (bundled `third_party/lance`)

TPC-H schemas (`src/dbgen/dbgen_wrapper.cpp`)

Lance FFI (`third_party/lance-ffi/src/lib.rs`, `src/writers/lance_writer.cpp`)