Skip to content

Lance streaming optimization: pre-computed stats or cardinality sampling#7

Merged
tsafin merged 14 commits intomasterfrom
tsafin/lance_stream
Feb 24, 2026
Merged

Lance streaming optimization: pre-computed stats or cardinality sampling#7
tsafin merged 14 commits intomasterfrom
tsafin/lance_stream

Conversation

@tsafin
Copy link
Owner

@tsafin tsafin commented Feb 23, 2026

Summary

  • Phase 3.1: Add --lance-cardinality-sample-rate CLI option to sample a fraction of values for HyperLogLogPlus cardinality estimation, reducing CPU overhead while preserving stats quality
  • Phase 3.2: Inject pre-computed TPC-H cardinalities via Arrow field metadata "lance.cardinality", bypassing HyperLogLogPlus entirely for bounded-cardinality columns
  • Lance FFI: Full rewrite of lance_writer with proper Arrow C Data Interface, null bitmap handling, and streaming write support
  • Benchmark results: SF=5 lineitem 323–426K rows/sec vs 217K baseline (+49–96%)

Key Changes

Lance encoding (bundled third_party/lance)

  • statistics.rs: CARDINALITY_HINT thread-local + set_cardinality_hint() public API; VariableWidthBlock::cardinality() returns pre-computed value when hint is set, skipping all hash operations
  • statistics.rs: LCG sampling with u64 threshold (fixes f32 precision bug at rate=0.2)
  • encodings/logical/primitive.rs: reads field.metadata["lance.cardinality"] before DataBlock::from_arrays() and sets hint

TPC-H schemas (src/dbgen/dbgen_wrapper.cpp)

  • tpch_field() helper attaches "lance.cardinality" metadata to known-cardinality string columns
  • Annotated columns per TPC-H spec / dists.dss: l_returnflag(3), l_linestatus(2), l_shipinstruct(4), l_shipmode(7), date fields(2556), o_orderstatus(3), o_orderpriority(5), c_mktsegment(5), p_mfgr(5), p_brand(25), p_type(150), p_container(40), etc.

Lance FFI (third_party/lance-ffi/src/lib.rs, src/writers/lance_writer.cpp)

  • Streaming write path using RecordBatchReader channel
  • Proper null bitmap and Arrow C Data Interface ownership semantics
  • Support for Float32, Date32, Boolean, nullable arrays

tsafin and others added 5 commits February 20, 2026 21:25
Benchmarked all formats (Parquet, ORC, CSV, Lance, Paimon, Iceberg) at SF=5
with zero-copy optimization. Key findings:

- ORC wins: 905K rows/sec average (17% faster than Parquet)
- Parquet: 771K rows/sec, excellent stability
- Lance/Paimon/Iceberg: High variance (60-177%), not production-ready
- CSV: 235K rows/sec, I/O bound

Added:
- Comprehensive benchmark table with maximum values in README
- Performance and stability visualization charts
- Benchmark scripts for automated testing and visualization
- Detailed analysis reports in benchmark-results/

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Includes two Lance commits:
- 381feaff: light stats/compression tuning and fixed-width single-array zero-copy path
- caac148a: binary/utf8 single-array zero-copy path when offsets are normalized
Pulls Lance commit ddf0ee5d to add safety-aware array retention in the encoder accumulation queue without env overrides.
Phase 3.1 - Cardinality sampling via CLI:
- Add --lance-cardinality-sample-rate <0.01-1.0> option to control what
  fraction of values are fed into HyperLogLogPlus cardinality estimation
- Sets LANCE_CARDINALITY_SAMPLE_RATE env var before Lance writer init
- Reduces HLL CPU overhead at the cost of minor cardinality accuracy

Phase 3.2 - Pre-computed TPC-H cardinality hints:
- Add tpch_field() helper in dbgen_wrapper.cpp that attaches
  "lance.cardinality" Arrow field metadata for known-cardinality columns
- Lance reads this metadata and skips HyperLogLogPlus entirely for those
  columns, returning the exact pre-computed cardinality instead
- Lineitem: l_returnflag(3), l_linestatus(2), l_shipinstruct(4),
  l_shipmode(7), date fields(2556); orders/customer/part similarly annotated
- Measured result: SF=5 lineitem 323-426K rows/sec vs 217K baseline (+49-96%)

Update bundled Lance: pre-computed cardinality hints and LCG sampling fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin tsafin changed the title Lance streaming write optimization: pre-computed stats and cardinality sampling Lance streaming optimization: pre-computed stats or cardinality sampling Feb 23, 2026
tsafin and others added 5 commits February 23, 2026 14:53
Points to https://github.com/tsafin/lance.git (tsafin/lance_stream branch)
which contains our custom optimizations:
- LCG cardinality sampling (Phase 3.1)
- Pre-computed cardinality hints via Arrow field metadata (Phase 3.2)
- Light-stats mode, zero-copy fast paths, auto-retention streaming

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker-images.yml:
- Add type=ref,event=branch so every docker build also publishes a stable
  branch tag (e.g. tsafin-lance_stream) alongside the SHA-pinned tag.
  :latest is still only pushed on the default branch (master).

ci.yml:
- Add resolve-images job that sanitizes the current branch name, checks
  whether a branch-specific image exists in GHCR (docker manifest inspect),
  and outputs the correct image tag per config (base/orc/lance).
  Falls back to :latest when no branch-specific image is available.
- build-matrix, benchmark-suite, optimization-benchmarks all now use
  needs.resolve-images.outputs.{base,orc,lance}_image instead of :latest.

This fixes CI builds on PR branches where third_party/lance-ffi has changed:
the branch image contains the freshly compiled liblance_ffi.a while :latest
still carries the master version with potentially missing FFI symbols.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The resolve-images job was incorrectly converting underscores to dashes,
producing 'tsafin-lance-stream-' instead of the actual published tag
'tsafin-lance_stream'. docker/metadata-action type=ref,event=branch only
replaces '/' with '-' and lowercases; underscores are preserved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace utf8() Arrow arrays with dictionary(int8(), utf8()) for 10 columns
with known-bounded cardinality from TPC-H spec/dists.dss. Lance routes
these to DictionaryDataBlock where compute_stat() is a literal no-op {}
— zero HLL/XXH3 overhead vs Phase 3.2's cardinality-hint approach.

Columns converted (10 total):
  lineitem: returnflag(3), linestatus(2), shipinstruct(4), shipmode(7)
  orders:   orderstatus(3), orderpriority(5)
  customer: mktsegment(5)
  part:     mfgr(5), brand(25), container(40)

Encode functions use O(1) switch on first character(s) — no hashing,
no strcmp needed. Both conversion paths updated: zero-copy batch path
(zero_copy_converter.cpp) and row-by-row path (dbgen_converter.cpp).

lance-ffi/lib.rs: apply_compression_metadata() now skips dict fields
(Lance handles DictionaryDataBlock encoding internally).

Results SF=5 lineitem (30M rows, clean runs):
  Phase 3.2 baseline: ~410-440K r/s
  Phase 3.3:          645-648K r/s (+57% over 3.2, +197% vs 217K original)
  Stability:          <0.5% variance (much better than 3.2)
  File size:          4.0 GB vs 4.24 GB (-5.7%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminate remaining VariableWidthBlock::compute_stat (HLL/XXH3) overhead
by adding scale-factor-derived cardinality hints to all high-cardinality
utf8 columns.  Perf profile showed 17.7% of samples in compute_stat for
l_comment and similar fields that had no hint — now all utf8 fields skip
HLL computation.

get_schema() now accepts scale_factor (default 1.0) and sets hints:
- Comment fields (l/o/c/p/ps/s_comment): hint = table_row_count * SF
- High-cardinality name/address/phone fields: hint = table_row_count * SF
- o_clerk: hint = 1000 * SF (Clerk#XXXXXXXXX, spec-defined cardinality)
- nation/region comments: small fixed counts (25, 5)
All dict8 columns already had zero-overhead compute_stat (no-op).

Also: Tokio runtime tuned for streaming path (worker_threads(1) vs
default multi-thread N-worker pool) to reduce idle thread overhead.

Benchmark SF=5 lineitem (non-zero-copy), window peaks up to 922K r/s
vs ~650K before this phase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin
Copy link
Owner Author

tsafin commented Feb 23, 2026

@copilot review

Copy link

Copilot AI commented Feb 23, 2026

@tsafin I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

@tsafin
Copy link
Owner Author

tsafin commented Feb 23, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1afbdb8a46

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tsafin and others added 4 commits February 24, 2026 03:05
…ule)

Update third_party/lance submodule to include Phase 3.5 changes:
- previous::PrimitiveFieldEncoder gets known_cardinality cached at encoder
  creation; set_cardinality_hint() called in each tokio encode task so
  VariableWidthBlock::cardinality() skips HyperLogLogPlus entirely.
- PrimitiveStructuralEncoder same pattern for V2.1+ consistency.

Perf result: compute_stat 17.7% → ~0.05%, window peaks 300K → 1.7M+ r/s
on SF=5 lineitem zero-copy path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Arrow dict8<int8, utf8> arrays (used for low-cardinality TPC-H columns
since Phase 3.3) were not handled in non-Lance writers:

- csv_writer: fell back to array->ToString() producing corrupt column
  representation per cell; fix does index→value lookup via GetString()
- orc_writer: arrow_type_to_orc_type_name() threw "Unsupported type" on
  DICTIONARY; copy_array_to_orc_column() also threw; fix maps DICTIONARY
  to ORC "string" and expands indices via GetView() into dict buffer
- paimon_writer: arrow_type_to_paimon_type() threw on DICTIONARY; fix
  maps to "string" (data path via parquet::arrow::WriteTable is fine)
- iceberg_writer: arrow_type_to_iceberg_type() threw on DICTIONARY; fix
  maps to "string" (data path via parquet::arrow::WriteTable is fine)

Parquet and Lance handle DICTIONARY natively through Arrow — no changes
needed. ORC fix tested via code review only (requires -DTPCH_ENABLE_ORC).
CSV fix verified: l_returnflag/linestatus/shipinstruct/shipmode output
correct string values ("N", "O", "DELIVER IN PERSON", "TRUCK").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the Rust streaming task exits early (disk error, write failure),
push() would block indefinitely: the queue stays full, nothing notifies
not_full_cv_, and the C++ producer hangs forever.

Fix:
- Add lance_writer_stream_is_alive() Rust FFI: polls JoinHandle::is_finished()
  to detect whether the background task is still consuming the stream.
- Add StreamState::set_health_check(fn): stores a callback invoked when
  a push() timeout fires without a real queue-drain notification.
- Rewrite the stall wait from bare wait() to a wait_for(500ms) loop:
  on each timeout, call the health check; if the task is gone, call
  set_error() which unblocks both not_full_cv_ and not_empty_cv_.
- Wire up in initialize_lance_dataset(): register a lambda that calls
  lance_writer_stream_is_alive(rust_writer_) after stream start.

The 500 ms poll interval means producers unblock within half a second of
a Rust task crash, rather than hanging until process kill. Normal
operation (no task failure) adds zero overhead: the health check is only
reached on a wait_for timeout, which only fires when the queue stays full
for 500 ms without a pop — i.e., already a stall event.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs identified by copilot-swe-agent in PR #8:

- lib.rs: fix memory leak on ArrowArrayStreamReader::from_raw() failure.
  The original code called libc::free() after the ? early-return, so the
  malloc'd ArrowArrayStream struct leaked if from_raw() failed. Fix:
  free unconditionally before propagating the error.

- lance_writer.cpp: remove dead no-op else branch in StreamState::push().
  The else path (queue not full) called not_full_cv_.wait() with a
  predicate already satisfied, making it a guaranteed immediate return.
  Removing it clarifies the backpressure logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin tsafin merged commit b16981d into master Feb 24, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants