Lance streaming optimization: pre-computed stats or cardinality sampling#7
Merged
Lance streaming optimization: pre-computed stats or cardinality sampling#7
Conversation
Benchmarked all formats (Parquet, ORC, CSV, Lance, Paimon, Iceberg) at SF=5 with zero-copy optimization. Key findings: - ORC wins: 905K rows/sec average (17% faster than Parquet) - Parquet: 771K rows/sec, excellent stability - Lance/Paimon/Iceberg: High variance (60-177%), not production-ready - CSV: 235K rows/sec, I/O bound Added: - Comprehensive benchmark table with maximum values in README - Performance and stability visualization charts - Benchmark scripts for automated testing and visualization - Detailed analysis reports in benchmark-results/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Includes two Lance commits: - 381feaff: light stats/compression tuning and fixed-width single-array zero-copy path - caac148a: binary/utf8 single-array zero-copy path when offsets are normalized
Pulls Lance commit ddf0ee5d to add safety-aware array retention in the encoder accumulation queue without env overrides.
Phase 3.1 - Cardinality sampling via CLI: - Add --lance-cardinality-sample-rate <0.01-1.0> option to control what fraction of values are fed into HyperLogLogPlus cardinality estimation - Sets LANCE_CARDINALITY_SAMPLE_RATE env var before Lance writer init - Reduces HLL CPU overhead at the cost of minor cardinality accuracy Phase 3.2 - Pre-computed TPC-H cardinality hints: - Add tpch_field() helper in dbgen_wrapper.cpp that attaches "lance.cardinality" Arrow field metadata for known-cardinality columns - Lance reads this metadata and skips HyperLogLogPlus entirely for those columns, returning the exact pre-computed cardinality instead - Lineitem: l_returnflag(3), l_linestatus(2), l_shipinstruct(4), l_shipmode(7), date fields(2556); orders/customer/part similarly annotated - Measured result: SF=5 lineitem 323-426K rows/sec vs 217K baseline (+49-96%) Update bundled Lance: pre-computed cardinality hints and LCG sampling fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Points to https://github.com/tsafin/lance.git (tsafin/lance_stream branch) which contains our custom optimizations: - LCG cardinality sampling (Phase 3.1) - Pre-computed cardinality hints via Arrow field metadata (Phase 3.2) - Light-stats mode, zero-copy fast paths, auto-retention streaming Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker-images.yml:
- Add type=ref,event=branch so every docker build also publishes a stable
branch tag (e.g. tsafin-lance_stream) alongside the SHA-pinned tag.
:latest is still only pushed on the default branch (master).
ci.yml:
- Add resolve-images job that sanitizes the current branch name, checks
whether a branch-specific image exists in GHCR (docker manifest inspect),
and outputs the correct image tag per config (base/orc/lance).
Falls back to :latest when no branch-specific image is available.
- build-matrix, benchmark-suite, optimization-benchmarks all now use
needs.resolve-images.outputs.{base,orc,lance}_image instead of :latest.
This fixes CI builds on PR branches where third_party/lance-ffi has changed:
the branch image contains the freshly compiled liblance_ffi.a while :latest
still carries the master version with potentially missing FFI symbols.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The resolve-images job was incorrectly converting underscores to dashes, producing 'tsafin-lance-stream-' instead of the actual published tag 'tsafin-lance_stream'. docker/metadata-action type=ref,event=branch only replaces '/' with '-' and lowercases; underscores are preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace utf8() Arrow arrays with dictionary(int8(), utf8()) for 10 columns
with known-bounded cardinality from TPC-H spec/dists.dss. Lance routes
these to DictionaryDataBlock where compute_stat() is a literal no-op {}
— zero HLL/XXH3 overhead vs Phase 3.2's cardinality-hint approach.
Columns converted (10 total):
lineitem: returnflag(3), linestatus(2), shipinstruct(4), shipmode(7)
orders: orderstatus(3), orderpriority(5)
customer: mktsegment(5)
part: mfgr(5), brand(25), container(40)
Encode functions use O(1) switch on first character(s) — no hashing,
no strcmp needed. Both conversion paths updated: zero-copy batch path
(zero_copy_converter.cpp) and row-by-row path (dbgen_converter.cpp).
lance-ffi/lib.rs: apply_compression_metadata() now skips dict fields
(Lance handles DictionaryDataBlock encoding internally).
Results SF=5 lineitem (30M rows, clean runs):
Phase 3.2 baseline: ~410-440K r/s
Phase 3.3: 645-648K r/s (+57% over 3.2, +197% vs 217K original)
Stability: <0.5% variance (much better than 3.2)
File size: 4.0 GB vs 4.24 GB (-5.7%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminate remaining VariableWidthBlock::compute_stat (HLL/XXH3) overhead by adding scale-factor-derived cardinality hints to all high-cardinality utf8 columns. Perf profile showed 17.7% of samples in compute_stat for l_comment and similar fields that had no hint — now all utf8 fields skip HLL computation. get_schema() now accepts scale_factor (default 1.0) and sets hints: - Comment fields (l/o/c/p/ps/s_comment): hint = table_row_count * SF - High-cardinality name/address/phone fields: hint = table_row_count * SF - o_clerk: hint = 1000 * SF (Clerk#XXXXXXXXX, spec-defined cardinality) - nation/region comments: small fixed counts (25, 5) All dict8 columns already had zero-overhead compute_stat (no-op). Also: Tokio runtime tuned for streaming path (worker_threads(1) vs default multi-thread N-worker pool) to reduce idle thread overhead. Benchmark SF=5 lineitem (non-zero-copy), window peaks up to 922K r/s vs ~650K before this phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner
Author
|
@copilot review |
Owner
Author
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1afbdb8a46
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ule) Update third_party/lance submodule to include Phase 3.5 changes: - previous::PrimitiveFieldEncoder gets known_cardinality cached at encoder creation; set_cardinality_hint() called in each tokio encode task so VariableWidthBlock::cardinality() skips HyperLogLogPlus entirely. - PrimitiveStructuralEncoder same pattern for V2.1+ consistency. Perf result: compute_stat 17.7% → ~0.05%, window peaks 300K → 1.7M+ r/s on SF=5 lineitem zero-copy path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Arrow dict8<int8, utf8> arrays (used for low-cardinality TPC-H columns
since Phase 3.3) were not handled in non-Lance writers:
- csv_writer: fell back to array->ToString() producing corrupt column
representation per cell; fix does index→value lookup via GetString()
- orc_writer: arrow_type_to_orc_type_name() threw "Unsupported type" on
DICTIONARY; copy_array_to_orc_column() also threw; fix maps DICTIONARY
to ORC "string" and expands indices via GetView() into dict buffer
- paimon_writer: arrow_type_to_paimon_type() threw on DICTIONARY; fix
maps to "string" (data path via parquet::arrow::WriteTable is fine)
- iceberg_writer: arrow_type_to_iceberg_type() threw on DICTIONARY; fix
maps to "string" (data path via parquet::arrow::WriteTable is fine)
Parquet and Lance handle DICTIONARY natively through Arrow — no changes
needed. ORC fix tested via code review only (requires -DTPCH_ENABLE_ORC).
CSV fix verified: l_returnflag/linestatus/shipinstruct/shipmode output
correct string values ("N", "O", "DELIVER IN PERSON", "TRUCK").
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the Rust streaming task exits early (disk error, write failure), push() would block indefinitely: the queue stays full, nothing notifies not_full_cv_, and the C++ producer hangs forever. Fix: - Add lance_writer_stream_is_alive() Rust FFI: polls JoinHandle::is_finished() to detect whether the background task is still consuming the stream. - Add StreamState::set_health_check(fn): stores a callback invoked when a push() timeout fires without a real queue-drain notification. - Rewrite the stall wait from bare wait() to a wait_for(500ms) loop: on each timeout, call the health check; if the task is gone, call set_error() which unblocks both not_full_cv_ and not_empty_cv_. - Wire up in initialize_lance_dataset(): register a lambda that calls lance_writer_stream_is_alive(rust_writer_) after stream start. The 500 ms poll interval means producers unblock within half a second of a Rust task crash, rather than hanging until process kill. Normal operation (no task failure) adds zero overhead: the health check is only reached on a wait_for timeout, which only fires when the queue stays full for 500 ms without a pop — i.e., already a stall event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs identified by copilot-swe-agent in PR #8: - lib.rs: fix memory leak on ArrowArrayStreamReader::from_raw() failure. The original code called libc::free() after the ? early-return, so the malloc'd ArrowArrayStream struct leaked if from_raw() failed. Fix: free unconditionally before propagating the error. - lance_writer.cpp: remove dead no-op else branch in StreamState::push(). The else path (queue not full) called not_full_cv_.wait() with a predicate already satisfied, making it a guaranteed immediate return. Removing it clarifies the backpressure logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--lance-cardinality-sample-rateCLI option to sample a fraction of values for HyperLogLogPlus cardinality estimation, reducing CPU overhead while preserving stats quality"lance.cardinality", bypassing HyperLogLogPlus entirely for bounded-cardinality columnsKey Changes
Lance encoding (bundled
third_party/lance)statistics.rs:CARDINALITY_HINTthread-local +set_cardinality_hint()public API;VariableWidthBlock::cardinality()returns pre-computed value when hint is set, skipping all hash operationsstatistics.rs: LCG sampling with u64 threshold (fixes f32 precision bug at rate=0.2)encodings/logical/primitive.rs: readsfield.metadata["lance.cardinality"]beforeDataBlock::from_arrays()and sets hintTPC-H schemas (
src/dbgen/dbgen_wrapper.cpp)tpch_field()helper attaches"lance.cardinality"metadata to known-cardinality string columnsl_returnflag(3),l_linestatus(2),l_shipinstruct(4),l_shipmode(7), date fields(2556),o_orderstatus(3),o_orderpriority(5),c_mktsegment(5),p_mfgr(5),p_brand(25),p_type(150),p_container(40), etc.Lance FFI (
third_party/lance-ffi/src/lib.rs,src/writers/lance_writer.cpp)RecordBatchReaderchannel