Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9
Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9
Conversation
…ration Add --lance-io-uring CLI flag that replaces the default tokio::fs write path with a custom io_uring-backed ObjectStore wrapper, giving 1.5–1.7× speedup on SF=10 lineitem (183.9 s → 107–122 s) by keeping the VirtIO-blk virtqueue saturated during kernel dirty-page writeback throttle periods. Design: - IoUringWrapper implements WrappingObjectStore, injected via ObjectStoreParams::object_store_wrapper at dataset open time - One persistent IoUring ring per file on a dedicated worker std::thread (eliminates ~800 io_uring_setup() syscalls for a 4 GB multipart upload) - tokio::sync::mpsc channel (async send) replaces std::sync::mpsc (blocked the single Tokio worker thread when channel was full) - AtomicU64 offset pre-claim gives each concurrent put_part() a distinct file range with no locking - IORING_SETUP_ATTACH_WQ: all rings after the first share one kernel async-worker pool via a leaked anchor ring (reduces Hyper-V IOThread pressure) - Queue depth calibrated from /sys/block/<dev>/queue/nr_requests of the actual output device (nr_requests/2, clamped [8,128]); falls back to QD=64 if sysfs unavailable. Replaces earlier fake /dev/urandom probe (character device, bypasses block layer entirely, measures CSPRNG not storage) - No O_DIRECT: on WSL2/VirtIO each O_DIRECT write synchronously awaits disk ACK, removing the page-cache pipelining benefit - No SQPOLL: creates a kernel busy-poll thread causing Windows scheduler context-switches on WSL2 - Graceful fallback to inner LocalFileSystem if IoUring::new() fails Benchmark results (SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk QD=633): Baseline (tokio spawn_blocking): 183.9 s 43.9 MB/s 144.6 s stalled (78.6%) io_uring v3 (QD=256): 107.4 s 75.3 MB/s 82.2 s stalled (76.5%) io_uring v5 (QD=128 via sysfs): 121.7 s 66.4 MB/s 96.8 s stalled (79.5%) Files: third_party/lance-ffi/src/io_uring_store.rs (new) third_party/lance-ffi/src/lib.rs (use_io_uring flag + FFI export) third_party/lance-ffi/Cargo.toml (io-uring = "0.7" default feature) include/tpch/lance_ffi.h (lance_writer_enable_io_uring()) include/tpch/lance_writer.hpp (enable_io_uring() method) src/writers/lance_writer.cpp (wire FFI call) src/main.cpp (--lance-io-uring flag) CMakeLists.txt (io_uring_store.rs in DEPENDS) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4829dc8f14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR implements an io_uring-based write path for Lance dataset writes, providing a 1.5-1.7× speedup for large writes on Linux systems. The implementation adds a custom ObjectStore wrapper that intercepts write operations and uses io_uring for asynchronous I/O instead of the default tokio spawn_blocking approach.
Changes:
- New io_uring-backed ObjectStore implementation with persistent rings per file, async channels, and sysfs-based queue depth calibration
- FFI integration with runtime activation via
--lance-io-uringCLI flag - Graceful fallback to standard LocalFileSystem when io_uring is unavailable
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/lance-ffi/src/io_uring_store.rs | Core io_uring implementation: IoUringStore, IoUringMultipartUpload, sysfs calibration, and shared kernel worker pool |
| third_party/lance-ffi/src/lib.rs | FFI integration: adds use_io_uring field and lance_writer_enable_io_uring export |
| third_party/lance-ffi/Cargo.toml | Adds io-uring 0.7 dependency with default feature flag |
| include/tpch/lance_ffi.h | FFI declaration for lance_writer_enable_io_uring function |
| include/tpch/lance_writer.hpp | C++ API: adds enable_io_uring method and use_io_uring_ member |
| src/writers/lance_writer.cpp | Wires FFI call in initialize_lance_dataset |
| src/main.cpp | Adds --lance-io-uring CLI flag and option handling |
| CMakeLists.txt | Adds io_uring_store.rs to cargo DEPENDS for rebuild detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let anchor_fd = ANCHOR_FD.get_or_init(|| { | ||
| match IoUring::new(qd) { | ||
| Ok(ring) => { | ||
| let fd = ring.as_raw_fd(); | ||
| // Leak the ring: Drop would close the fd, killing the pool. | ||
| std::mem::forget(ring); | ||
| eprintln!("Lance FFI: io_uring anchor ring created (fd={}, QD={})", fd, qd); | ||
| fd | ||
| } | ||
| Err(_) => -1, // sentinel: ring creation failed | ||
| } | ||
| }); | ||
|
|
||
| if *anchor_fd >= 0 { | ||
| // Attach new ring to the shared kernel worker pool. | ||
| IoUring::builder() | ||
| .setup_attach_wq(*anchor_fd) | ||
| .build(qd) |
There was a problem hiding this comment.
The anchor ring is created with the QD from the first file write (line 129), but subsequent rings created with different QD values (line 145 build(qd)) may have mismatched queue depths. While IORING_SETUP_ATTACH_WQ shares the kernel worker pool, each ring should still have its own appropriately-sized submission/completion queues. If different files target devices with very different nr_requests values, using the first device's QD for the anchor could be suboptimal. Consider either: (1) using a fixed conservative QD for the anchor (e.g., 32), or (2) storing the anchor QD and documenting that all rings share this constraint.
There was a problem hiding this comment.
Fair point. In our benchmark setup there is only one block device, so the first file's calibrated QD is always representative. For a multi-device scenario (e.g. writing to two different SSDs simultaneously) the anchor's QD would be whichever device happened to write first, which could be suboptimal. Using a fixed conservative QD (e.g. 32) for the anchor would be safer. Added a comment in 5cd06d1 documenting this limitation; a follow-up could use a fixed anchor QD independent of the per-file calibration.
…_URING CI uses a pre-compiled liblance_ffi.a that lacks the new symbol. Gate the declaration, call sites, and method behind #ifdef TPCH_LANCE_IO_URING, which CMake only defines when building lance-ffi from source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- to_fs_path: don't prepend '/' if path already starts with '/' to avoid double-slash paths like '//tmp/lineitem.lance' - try_sysfs_queue_depth: use /sys/dev/block/MAJOR:MINOR symlink instead of scanning /sys/block/ by major number only; correctly handles partitions (e.g. sda1 vs sdb1 sharing major=8) by resolving the symlink and walking up to the parent disk device that owns queue/nr_requests - spawn_writer_thread: move raw_fd extraction inside the closure, after 'file' is established, making the lifetime relationship explicit - put_part: SeqCst → Relaxed for offset fetch_add; each part writes to its own non-overlapping range so no cross-thread happens-before is needed - lance_writer.cpp: check return value of lance_writer_enable_io_uring and throw on failure, consistent with lance_writer_set_write_params handling - lib.rs: gate io_uring module on target_os = "linux" in addition to the feature flag; the module uses Unix-only APIs (AsRawFd, MetadataExt, sysfs) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the Copilot review comments in the follow-up commit (5cd06d1). Summary: Fixed:
Not fixed (with rationale):
|
Summary
--lance-io-uringCLI flag that replaces the defaulttokio::fswrite path with a customio_uring-backedObjectStorewrapperDesign
IoUringWrapper/IoUringStore— implementsWrappingObjectStore, injected viaObjectStoreParams::object_store_wrapperat Lance dataset open time. Writes go through io_uring; reads/metadata delegate to the innerLocalFileSystem.Persistent ring per file — one
IoUringring lives on a dedicatedstd::threadfor the entire multipart upload, eliminating ~800io_uring_setup()syscalls for a 4 GB file.Async channel —
tokio::sync::mpscwith.awaitsend replacesstd::sync::mpscwhich blocked the single Tokio worker thread when the channel was full (causing the v1 benchmark to hang).AtomicU64offset pre-claim — each concurrentput_part()pre-claims its file range withfetch_addbefore async dispatch, so parts can be written out-of-order without locking.IORING_SETUP_ATTACH_WQ— all rings after the first attach to a leaked anchor ring, sharing one kernel async-worker thread pool. Reduces Hyper-V IOThread scheduler pressure on WSL2.Sysfs queue-depth calibration — QD is derived from
/sys/block/<dev>/queue/nr_requestsof the actual output device (nr_requests / 2, clamped to [8, 128]). Replaces an earlier/dev/urandomprobe which bypasses the block layer entirely (character device, measures CSPRNG not storage). Falls back to QD=64 if sysfs is unavailable.Intentional omissions:
O_DIRECT— on WSL2/VirtIO each O_DIRECT write synchronously awaits a disk ACK, removing page-cache pipelining benefitIORING_SETUP_SQPOLL— creates a kernel busy-poll thread causing continuous Windows scheduler context-switches on WSL2Graceful fallback — if
IoUring::new()fails (old kernel, io_uring disabled),put_multipart_optstransparently falls back to the innerLocalFileSystem.Benchmark results
SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk (
sdd,nr_requests=633):tokio spawn_blocking)v3 vs v5 difference is within WSL2 run-to-run variance (stall ratio fluctuates ±3%).
Files changed
third_party/lance-ffi/src/io_uring_store.rsIoUringWrapper,IoUringStore,IoUringMultipartUpload, sysfs calibration, ATTACH_WQ anchorthird_party/lance-ffi/src/lib.rsuse_io_uringfield inWriteParamsConfig,lance_writer_enable_io_uring()FFI exportthird_party/lance-ffi/Cargo.tomlio-uring = "0.7"as default featureinclude/tpch/lance_ffi.hlance_writer_enable_io_uring()declarationinclude/tpch/lance_writer.hppenable_io_uring()method +use_io_uring_membersrc/writers/lance_writer.cppinitialize_lance_dataset()src/main.cpp--lance-io-uringflag +OPT_LANCE_IO_URING = 1007CMakeLists.txtio_uring_store.rsadded to cargo DEPENDSTest plan
TPCH_ENABLE_LANCE=ON—cmake --build --target lance_ffi --clean-first(forces cargo rebuild)./tpch_benchmark --use-dbgen --format lance --table customer --scale-factor 1 --lance-io-uring— verify calibration log line and correct row countdu -sh /tmp/customer.lance(must not be 2× — alwaysrm -rf /tmp/*.lancebefore each run, Lance appends fragments)--scale-factor 10 --table lineitem --max-rows 0 --zero-copy --lance-io-uring— expect ≥1.4× vs baseline