Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes) by tsafin · Pull Request #9 · tsafin/tpch-cpp

tsafin · 2026-02-25T23:27:32Z

Summary

Adds --lance-io-uring CLI flag that replaces the default tokio::fs write path with a custom io_uring-backed ObjectStore wrapper
1.5–1.7× wall-time speedup on SF=10 lineitem (8.47 GB): 183.9 s → 107–122 s
Eliminates the worst-case single-stall events caused by Linux dirty-page writeback throttling; stall time reduced from 144.6 s to ~82–97 s (78.6% → 76–79% stall ratio)

Design

IoUringWrapper / IoUringStore — implements WrappingObjectStore, injected via ObjectStoreParams::object_store_wrapper at Lance dataset open time. Writes go through io_uring; reads/metadata delegate to the inner LocalFileSystem.

Persistent ring per file — one IoUring ring lives on a dedicated std::thread for the entire multipart upload, eliminating ~800 io_uring_setup() syscalls for a 4 GB file.

Async channel — tokio::sync::mpsc with .await send replaces std::sync::mpsc which blocked the single Tokio worker thread when the channel was full (causing the v1 benchmark to hang).

AtomicU64 offset pre-claim — each concurrent put_part() pre-claims its file range with fetch_add before async dispatch, so parts can be written out-of-order without locking.

IORING_SETUP_ATTACH_WQ — all rings after the first attach to a leaked anchor ring, sharing one kernel async-worker thread pool. Reduces Hyper-V IOThread scheduler pressure on WSL2.

Sysfs queue-depth calibration — QD is derived from /sys/block/<dev>/queue/nr_requests of the actual output device (nr_requests / 2, clamped to [8, 128]). Replaces an earlier /dev/urandom probe which bypasses the block layer entirely (character device, measures CSPRNG not storage). Falls back to QD=64 if sysfs is unavailable.

Intentional omissions:

No O_DIRECT — on WSL2/VirtIO each O_DIRECT write synchronously awaits a disk ACK, removing page-cache pipelining benefit
No IORING_SETUP_SQPOLL — creates a kernel busy-poll thread causing continuous Windows scheduler context-switches on WSL2

Graceful fallback — if IoUring::new() fails (old kernel, io_uring disabled), put_multipart_opts transparently falls back to the inner LocalFileSystem.

Benchmark results

SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk (sdd, nr_requests=633):

Run	Wall time	Avg BW	Stall time	Stall ratio	Speedup
Baseline (`tokio spawn_blocking`)	183.9 s	43.9 MB/s	144.6 s	78.6%	1.00×
io_uring v3 (QD=256 hardcoded)	107.4 s	75.3 MB/s	82.2 s	76.5%	1.71×
io_uring v5 (QD=128 via sysfs)	121.7 s	66.4 MB/s	96.8 s	79.5%	1.51×

v3 vs v5 difference is within WSL2 run-to-run variance (stall ratio fluctuates ±3%).

Files changed

File	Change
`third_party/lance-ffi/src/io_uring_store.rs`	New: `IoUringWrapper`, `IoUringStore`, `IoUringMultipartUpload`, sysfs calibration, ATTACH_WQ anchor
`third_party/lance-ffi/src/lib.rs`	`use_io_uring` field in `WriteParamsConfig`, `lance_writer_enable_io_uring()` FFI export
`third_party/lance-ffi/Cargo.toml`	`io-uring = "0.7"` as default feature
`include/tpch/lance_ffi.h`	`lance_writer_enable_io_uring()` declaration
`include/tpch/lance_writer.hpp`	`enable_io_uring()` method + `use_io_uring_` member
`src/writers/lance_writer.cpp`	Wire FFI call in `initialize_lance_dataset()`
`src/main.cpp`	`--lance-io-uring` flag + `OPT_LANCE_IO_URING = 1007`
`CMakeLists.txt`	`io_uring_store.rs` added to cargo DEPENDS

Test plan

Build with TPCH_ENABLE_LANCE=ON — cmake --build --target lance_ffi --clean-first (forces cargo rebuild)
Smoke test: ./tpch_benchmark --use-dbgen --format lance --table customer --scale-factor 1 --lance-io-uring — verify calibration log line and correct row count
Verify file size matches non-io_uring run: du -sh /tmp/customer.lance (must not be 2× — always rm -rf /tmp/*.lance before each run, Lance appends fragments)
Full benchmark: --scale-factor 10 --table lineitem --max-rows 0 --zero-copy --lance-io-uring — expect ≥1.4× vs baseline
Without flag: baseline path unchanged, no regression

…ration Add --lance-io-uring CLI flag that replaces the default tokio::fs write path with a custom io_uring-backed ObjectStore wrapper, giving 1.5–1.7× speedup on SF=10 lineitem (183.9 s → 107–122 s) by keeping the VirtIO-blk virtqueue saturated during kernel dirty-page writeback throttle periods. Design: - IoUringWrapper implements WrappingObjectStore, injected via ObjectStoreParams::object_store_wrapper at dataset open time - One persistent IoUring ring per file on a dedicated worker std::thread (eliminates ~800 io_uring_setup() syscalls for a 4 GB multipart upload) - tokio::sync::mpsc channel (async send) replaces std::sync::mpsc (blocked the single Tokio worker thread when channel was full) - AtomicU64 offset pre-claim gives each concurrent put_part() a distinct file range with no locking - IORING_SETUP_ATTACH_WQ: all rings after the first share one kernel async-worker pool via a leaked anchor ring (reduces Hyper-V IOThread pressure) - Queue depth calibrated from /sys/block/<dev>/queue/nr_requests of the actual output device (nr_requests/2, clamped [8,128]); falls back to QD=64 if sysfs unavailable. Replaces earlier fake /dev/urandom probe (character device, bypasses block layer entirely, measures CSPRNG not storage) - No O_DIRECT: on WSL2/VirtIO each O_DIRECT write synchronously awaits disk ACK, removing the page-cache pipelining benefit - No SQPOLL: creates a kernel busy-poll thread causing Windows scheduler context-switches on WSL2 - Graceful fallback to inner LocalFileSystem if IoUring::new() fails Benchmark results (SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk QD=633): Baseline (tokio spawn_blocking): 183.9 s 43.9 MB/s 144.6 s stalled (78.6%) io_uring v3 (QD=256): 107.4 s 75.3 MB/s 82.2 s stalled (76.5%) io_uring v5 (QD=128 via sysfs): 121.7 s 66.4 MB/s 96.8 s stalled (79.5%) Files: third_party/lance-ffi/src/io_uring_store.rs (new) third_party/lance-ffi/src/lib.rs (use_io_uring flag + FFI export) third_party/lance-ffi/Cargo.toml (io-uring = "0.7" default feature) include/tpch/lance_ffi.h (lance_writer_enable_io_uring()) include/tpch/lance_writer.hpp (enable_io_uring() method) src/writers/lance_writer.cpp (wire FFI call) src/main.cpp (--lance-io-uring flag) CMakeLists.txt (io_uring_store.rs in DEPENDS) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tsafin · 2026-02-25T23:29:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4829dc8f14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

third_party/lance-ffi/src/io_uring_store.rs

Copilot

Pull request overview

This PR implements an io_uring-based write path for Lance dataset writes, providing a 1.5-1.7× speedup for large writes on Linux systems. The implementation adds a custom ObjectStore wrapper that intercepts write operations and uses io_uring for asynchronous I/O instead of the default tokio spawn_blocking approach.

Changes:

New io_uring-backed ObjectStore implementation with persistent rings per file, async channels, and sysfs-based queue depth calibration
FFI integration with runtime activation via --lance-io-uring CLI flag
Graceful fallback to standard LocalFileSystem when io_uring is unavailable

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
third_party/lance-ffi/src/io_uring_store.rs	Core io_uring implementation: IoUringStore, IoUringMultipartUpload, sysfs calibration, and shared kernel worker pool
third_party/lance-ffi/src/lib.rs	FFI integration: adds use_io_uring field and lance_writer_enable_io_uring export
third_party/lance-ffi/Cargo.toml	Adds io-uring 0.7 dependency with default feature flag
include/tpch/lance_ffi.h	FFI declaration for lance_writer_enable_io_uring function
include/tpch/lance_writer.hpp	C++ API: adds enable_io_uring method and use_io_uring_ member
src/writers/lance_writer.cpp	Wires FFI call in initialize_lance_dataset
src/main.cpp	Adds --lance-io-uring CLI flag and option handling
CMakeLists.txt	Adds io_uring_store.rs to cargo DEPENDS for rebuild detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/writers/lance_writer.cpp

third_party/lance-ffi/src/io_uring_store.rs

third_party/lance-ffi/src/lib.rs

third_party/lance-ffi/src/io_uring_store.rs

Copilot · 2026-02-25T23:37:48Z

third_party/lance-ffi/src/io_uring_store.rs

+    let anchor_fd = ANCHOR_FD.get_or_init(|| {
+        match IoUring::new(qd) {
+            Ok(ring) => {
+                let fd = ring.as_raw_fd();
+                // Leak the ring: Drop would close the fd, killing the pool.
+                std::mem::forget(ring);
+                eprintln!("Lance FFI: io_uring anchor ring created (fd={}, QD={})", fd, qd);
+                fd
+            }
+            Err(_) => -1, // sentinel: ring creation failed
+        }
+    });
+
+    if *anchor_fd >= 0 {
+        // Attach new ring to the shared kernel worker pool.
+        IoUring::builder()
+            .setup_attach_wq(*anchor_fd)
+            .build(qd)


The anchor ring is created with the QD from the first file write (line 129), but subsequent rings created with different QD values (line 145 build(qd)) may have mismatched queue depths. While IORING_SETUP_ATTACH_WQ shares the kernel worker pool, each ring should still have its own appropriately-sized submission/completion queues. If different files target devices with very different nr_requests values, using the first device's QD for the anchor could be suboptimal. Consider either: (1) using a fixed conservative QD for the anchor (e.g., 32), or (2) storing the anchor QD and documenting that all rings share this constraint.

Fair point. In our benchmark setup there is only one block device, so the first file's calibrated QD is always representative. For a multi-device scenario (e.g. writing to two different SSDs simultaneously) the anchor's QD would be whichever device happened to write first, which could be suboptimal. Using a fixed conservative QD (e.g. 32) for the anchor would be safer. Added a comment in 5cd06d1 documenting this limitation; a follow-up could use a fixed anchor QD independent of the per-file calibration.

third_party/lance-ffi/src/io_uring_store.rs

…_URING CI uses a pre-compiled liblance_ffi.a that lacks the new symbol. Gate the declaration, call sites, and method behind #ifdef TPCH_LANCE_IO_URING, which CMake only defines when building lance-ffi from source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- to_fs_path: don't prepend '/' if path already starts with '/' to avoid double-slash paths like '//tmp/lineitem.lance' - try_sysfs_queue_depth: use /sys/dev/block/MAJOR:MINOR symlink instead of scanning /sys/block/ by major number only; correctly handles partitions (e.g. sda1 vs sdb1 sharing major=8) by resolving the symlink and walking up to the parent disk device that owns queue/nr_requests - spawn_writer_thread: move raw_fd extraction inside the closure, after 'file' is established, making the lifetime relationship explicit - put_part: SeqCst → Relaxed for offset fetch_add; each part writes to its own non-overlapping range so no cross-thread happens-before is needed - lance_writer.cpp: check return value of lance_writer_enable_io_uring and throw on failure, consistent with lance_writer_set_write_params handling - lib.rs: gate io_uring module on target_os = "linux" in addition to the feature flag; the module uses Unix-only APIs (AsRawFd, MetadataExt, sysfs) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tsafin · 2026-03-02T21:33:39Z

Addressed the Copilot review comments in the follow-up commit (5cd06d1). Summary:

Fixed:

to_fs_path double-slash (P1): now checks if the path string already starts with / before prepending, avoiding paths like //tmp/lineitem.lance.
Major-only device matching: replaced the /sys/block/*/dev scan (which matched any disk sharing the same major number) with a /sys/dev/block/MAJOR:MINOR symlink lookup. This correctly identifies the device even for partition paths (e.g. sda1 resolves to sda's queue settings by walking up the sysfs tree).
raw_fd extraction: moved inside the worker thread closure, after file is moved in, so the lifetime relationship is unambiguous.
SeqCst → Relaxed: each multipart part pre-claims its own non-overlapping file range, so no cross-thread happens-before ordering is needed for the offset counter.
Return value of lance_writer_enable_io_uring: now checked; throws std::runtime_error on failure, consistent with lance_writer_set_write_params.
#[cfg(target_os = "linux")]: added alongside the feature gate on the io_uring_store module and its use sites in lib.rs, since the module uses Unix-only APIs (AsRawFd, MetadataExt, sysfs).

Not fixed (with rationale):

fsync before close: intentionally omitted. Lance's object-store commit protocol provides durability guarantees at the dataset level (manifest write). Adding fsync per data file would eliminate the throughput advantage this PR is benchmarking. For production use, a higher-level sync_data on Lance commit would be the right place.
Short writes: the existing post-loop written < total check already surfaces the error to the caller. Per-SQE retry would require tracking partial-write offsets across the io_uring completion queue — a meaningful refactor. Given that short writes on regular files on Linux are essentially a theoretical concern (the kernel guarantees full WRITE ops on buffered regular files), this is deferred.
PutOptions / PutMultipartOptions: attributes and tags are not used by Lance for local columnar data writes. Renamed unused opts to _opts with a comment; the put_multipart_opts fallback path correctly passes opts to the inner store.
Anchor ring QD: the anchor ring's sole purpose is to share the kernel worker pool, not to submit I/O. Its QD only determines the size of the shared pool — using the first file's calibrated QD (rather than a fixed constant) is fine for our single-device benchmark setup. Documented in a comment.

tsafin requested a review from Copilot February 25, 2026 23:29

Copilot started reviewing on behalf of tsafin February 25, 2026 23:29 View session

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

third_party/lance-ffi/src/io_uring_store.rs Outdated Show resolved Hide resolved

third_party/lance-ffi/src/io_uring_store.rs Show resolved Hide resolved

Copilot AI reviewed Feb 25, 2026

View reviewed changes

tsafin and others added 2 commits February 26, 2026 02:41

tsafin merged commit b1cf2d0 into master Mar 2, 2026
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9

Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9
tsafin merged 3 commits intomasterfrom
tsafin/lance_ui_uring

tsafin commented Feb 25, 2026 •

edited

Loading

Uh oh!

tsafin commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

tsafin Mar 2, 2026

Uh oh!

Uh oh!

tsafin commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsafin commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Benchmark results

Files changed

Test plan

Uh oh!

tsafin commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

tsafin Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tsafin commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsafin commented Feb 25, 2026 •

edited

Loading