Skip to content

Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9

Merged
tsafin merged 3 commits intomasterfrom
tsafin/lance_ui_uring
Mar 2, 2026
Merged

Phase 3.6: io_uring write path for Lance (1.5–1.7× speedup on large writes)#9
tsafin merged 3 commits intomasterfrom
tsafin/lance_ui_uring

Conversation

@tsafin
Copy link
Owner

@tsafin tsafin commented Feb 25, 2026

Summary

  • Adds --lance-io-uring CLI flag that replaces the default tokio::fs write path with a custom io_uring-backed ObjectStore wrapper
  • 1.5–1.7× wall-time speedup on SF=10 lineitem (8.47 GB): 183.9 s → 107–122 s
  • Eliminates the worst-case single-stall events caused by Linux dirty-page writeback throttling; stall time reduced from 144.6 s to ~82–97 s (78.6% → 76–79% stall ratio)

Design

IoUringWrapper / IoUringStore — implements WrappingObjectStore, injected via ObjectStoreParams::object_store_wrapper at Lance dataset open time. Writes go through io_uring; reads/metadata delegate to the inner LocalFileSystem.

Persistent ring per file — one IoUring ring lives on a dedicated std::thread for the entire multipart upload, eliminating ~800 io_uring_setup() syscalls for a 4 GB file.

Async channeltokio::sync::mpsc with .await send replaces std::sync::mpsc which blocked the single Tokio worker thread when the channel was full (causing the v1 benchmark to hang).

AtomicU64 offset pre-claim — each concurrent put_part() pre-claims its file range with fetch_add before async dispatch, so parts can be written out-of-order without locking.

IORING_SETUP_ATTACH_WQ — all rings after the first attach to a leaked anchor ring, sharing one kernel async-worker thread pool. Reduces Hyper-V IOThread scheduler pressure on WSL2.

Sysfs queue-depth calibration — QD is derived from /sys/block/<dev>/queue/nr_requests of the actual output device (nr_requests / 2, clamped to [8, 128]). Replaces an earlier /dev/urandom probe which bypasses the block layer entirely (character device, measures CSPRNG not storage). Falls back to QD=64 if sysfs is unavailable.

Intentional omissions:

  • No O_DIRECT — on WSL2/VirtIO each O_DIRECT write synchronously awaits a disk ACK, removing page-cache pipelining benefit
  • No IORING_SETUP_SQPOLL — creates a kernel busy-poll thread causing continuous Windows scheduler context-switches on WSL2

Graceful fallback — if IoUring::new() fails (old kernel, io_uring disabled), put_multipart_opts transparently falls back to the inner LocalFileSystem.

Benchmark results

SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk (sdd, nr_requests=633):

Run Wall time Avg BW Stall time Stall ratio Speedup
Baseline (tokio spawn_blocking) 183.9 s 43.9 MB/s 144.6 s 78.6% 1.00×
io_uring v3 (QD=256 hardcoded) 107.4 s 75.3 MB/s 82.2 s 76.5% 1.71×
io_uring v5 (QD=128 via sysfs) 121.7 s 66.4 MB/s 96.8 s 79.5% 1.51×

v3 vs v5 difference is within WSL2 run-to-run variance (stall ratio fluctuates ±3%).

Files changed

File Change
third_party/lance-ffi/src/io_uring_store.rs New: IoUringWrapper, IoUringStore, IoUringMultipartUpload, sysfs calibration, ATTACH_WQ anchor
third_party/lance-ffi/src/lib.rs use_io_uring field in WriteParamsConfig, lance_writer_enable_io_uring() FFI export
third_party/lance-ffi/Cargo.toml io-uring = "0.7" as default feature
include/tpch/lance_ffi.h lance_writer_enable_io_uring() declaration
include/tpch/lance_writer.hpp enable_io_uring() method + use_io_uring_ member
src/writers/lance_writer.cpp Wire FFI call in initialize_lance_dataset()
src/main.cpp --lance-io-uring flag + OPT_LANCE_IO_URING = 1007
CMakeLists.txt io_uring_store.rs added to cargo DEPENDS

Test plan

  • Build with TPCH_ENABLE_LANCE=ONcmake --build --target lance_ffi --clean-first (forces cargo rebuild)
  • Smoke test: ./tpch_benchmark --use-dbgen --format lance --table customer --scale-factor 1 --lance-io-uring — verify calibration log line and correct row count
  • Verify file size matches non-io_uring run: du -sh /tmp/customer.lance (must not be 2× — always rm -rf /tmp/*.lance before each run, Lance appends fragments)
  • Full benchmark: --scale-factor 10 --table lineitem --max-rows 0 --zero-copy --lance-io-uring — expect ≥1.4× vs baseline
  • Without flag: baseline path unchanged, no regression

…ration

Add --lance-io-uring CLI flag that replaces the default tokio::fs write path
with a custom io_uring-backed ObjectStore wrapper, giving 1.5–1.7× speedup on
SF=10 lineitem (183.9 s → 107–122 s) by keeping the VirtIO-blk virtqueue
saturated during kernel dirty-page writeback throttle periods.

Design:
- IoUringWrapper implements WrappingObjectStore, injected via
  ObjectStoreParams::object_store_wrapper at dataset open time
- One persistent IoUring ring per file on a dedicated worker std::thread
  (eliminates ~800 io_uring_setup() syscalls for a 4 GB multipart upload)
- tokio::sync::mpsc channel (async send) replaces std::sync::mpsc (blocked
  the single Tokio worker thread when channel was full)
- AtomicU64 offset pre-claim gives each concurrent put_part() a distinct
  file range with no locking
- IORING_SETUP_ATTACH_WQ: all rings after the first share one kernel
  async-worker pool via a leaked anchor ring (reduces Hyper-V IOThread pressure)
- Queue depth calibrated from /sys/block/<dev>/queue/nr_requests of the actual
  output device (nr_requests/2, clamped [8,128]); falls back to QD=64 if sysfs
  unavailable. Replaces earlier fake /dev/urandom probe (character device,
  bypasses block layer entirely, measures CSPRNG not storage)
- No O_DIRECT: on WSL2/VirtIO each O_DIRECT write synchronously awaits disk
  ACK, removing the page-cache pipelining benefit
- No SQPOLL: creates a kernel busy-poll thread causing Windows scheduler
  context-switches on WSL2
- Graceful fallback to inner LocalFileSystem if IoUring::new() fails

Benchmark results (SF=10 lineitem, 8.47 GB, WSL2 VirtIO-blk QD=633):
  Baseline (tokio spawn_blocking):  183.9 s  43.9 MB/s  144.6 s stalled (78.6%)
  io_uring v3 (QD=256):             107.4 s  75.3 MB/s   82.2 s stalled (76.5%)
  io_uring v5 (QD=128 via sysfs):   121.7 s  66.4 MB/s   96.8 s stalled (79.5%)

Files:
  third_party/lance-ffi/src/io_uring_store.rs  (new)
  third_party/lance-ffi/src/lib.rs             (use_io_uring flag + FFI export)
  third_party/lance-ffi/Cargo.toml             (io-uring = "0.7" default feature)
  include/tpch/lance_ffi.h                     (lance_writer_enable_io_uring())
  include/tpch/lance_writer.hpp                (enable_io_uring() method)
  src/writers/lance_writer.cpp                 (wire FFI call)
  src/main.cpp                                 (--lance-io-uring flag)
  CMakeLists.txt                               (io_uring_store.rs in DEPENDS)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin
Copy link
Owner Author

tsafin commented Feb 25, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4829dc8f14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an io_uring-based write path for Lance dataset writes, providing a 1.5-1.7× speedup for large writes on Linux systems. The implementation adds a custom ObjectStore wrapper that intercepts write operations and uses io_uring for asynchronous I/O instead of the default tokio spawn_blocking approach.

Changes:

  • New io_uring-backed ObjectStore implementation with persistent rings per file, async channels, and sysfs-based queue depth calibration
  • FFI integration with runtime activation via --lance-io-uring CLI flag
  • Graceful fallback to standard LocalFileSystem when io_uring is unavailable

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
third_party/lance-ffi/src/io_uring_store.rs Core io_uring implementation: IoUringStore, IoUringMultipartUpload, sysfs calibration, and shared kernel worker pool
third_party/lance-ffi/src/lib.rs FFI integration: adds use_io_uring field and lance_writer_enable_io_uring export
third_party/lance-ffi/Cargo.toml Adds io-uring 0.7 dependency with default feature flag
include/tpch/lance_ffi.h FFI declaration for lance_writer_enable_io_uring function
include/tpch/lance_writer.hpp C++ API: adds enable_io_uring method and use_io_uring_ member
src/writers/lance_writer.cpp Wires FFI call in initialize_lance_dataset
src/main.cpp Adds --lance-io-uring CLI flag and option handling
CMakeLists.txt Adds io_uring_store.rs to cargo DEPENDS for rebuild detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +128 to +145
let anchor_fd = ANCHOR_FD.get_or_init(|| {
match IoUring::new(qd) {
Ok(ring) => {
let fd = ring.as_raw_fd();
// Leak the ring: Drop would close the fd, killing the pool.
std::mem::forget(ring);
eprintln!("Lance FFI: io_uring anchor ring created (fd={}, QD={})", fd, qd);
fd
}
Err(_) => -1, // sentinel: ring creation failed
}
});

if *anchor_fd >= 0 {
// Attach new ring to the shared kernel worker pool.
IoUring::builder()
.setup_attach_wq(*anchor_fd)
.build(qd)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anchor ring is created with the QD from the first file write (line 129), but subsequent rings created with different QD values (line 145 build(qd)) may have mismatched queue depths. While IORING_SETUP_ATTACH_WQ shares the kernel worker pool, each ring should still have its own appropriately-sized submission/completion queues. If different files target devices with very different nr_requests values, using the first device's QD for the anchor could be suboptimal. Consider either: (1) using a fixed conservative QD for the anchor (e.g., 32), or (2) storing the anchor QD and documenting that all rings share this constraint.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. In our benchmark setup there is only one block device, so the first file's calibrated QD is always representative. For a multi-device scenario (e.g. writing to two different SSDs simultaneously) the anchor's QD would be whichever device happened to write first, which could be suboptimal. Using a fixed conservative QD (e.g. 32) for the anchor would be safer. Added a comment in 5cd06d1 documenting this limitation; a follow-up could use a fixed anchor QD independent of the per-file calibration.

tsafin and others added 2 commits February 26, 2026 02:41
…_URING

CI uses a pre-compiled liblance_ffi.a that lacks the new symbol.
Gate the declaration, call sites, and method behind #ifdef TPCH_LANCE_IO_URING,
which CMake only defines when building lance-ffi from source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- to_fs_path: don't prepend '/' if path already starts with '/' to avoid
  double-slash paths like '//tmp/lineitem.lance'
- try_sysfs_queue_depth: use /sys/dev/block/MAJOR:MINOR symlink instead of
  scanning /sys/block/ by major number only; correctly handles partitions
  (e.g. sda1 vs sdb1 sharing major=8) by resolving the symlink and walking
  up to the parent disk device that owns queue/nr_requests
- spawn_writer_thread: move raw_fd extraction inside the closure, after
  'file' is established, making the lifetime relationship explicit
- put_part: SeqCst → Relaxed for offset fetch_add; each part writes to its
  own non-overlapping range so no cross-thread happens-before is needed
- lance_writer.cpp: check return value of lance_writer_enable_io_uring and
  throw on failure, consistent with lance_writer_set_write_params handling
- lib.rs: gate io_uring module on target_os = "linux" in addition to the
  feature flag; the module uses Unix-only APIs (AsRawFd, MetadataExt, sysfs)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tsafin
Copy link
Owner Author

tsafin commented Mar 2, 2026

Addressed the Copilot review comments in the follow-up commit (5cd06d1). Summary:

Fixed:

  • to_fs_path double-slash (P1): now checks if the path string already starts with / before prepending, avoiding paths like //tmp/lineitem.lance.
  • Major-only device matching: replaced the /sys/block/*/dev scan (which matched any disk sharing the same major number) with a /sys/dev/block/MAJOR:MINOR symlink lookup. This correctly identifies the device even for partition paths (e.g. sda1 resolves to sda's queue settings by walking up the sysfs tree).
  • raw_fd extraction: moved inside the worker thread closure, after file is moved in, so the lifetime relationship is unambiguous.
  • SeqCstRelaxed: each multipart part pre-claims its own non-overlapping file range, so no cross-thread happens-before ordering is needed for the offset counter.
  • Return value of lance_writer_enable_io_uring: now checked; throws std::runtime_error on failure, consistent with lance_writer_set_write_params.
  • #[cfg(target_os = "linux")]: added alongside the feature gate on the io_uring_store module and its use sites in lib.rs, since the module uses Unix-only APIs (AsRawFd, MetadataExt, sysfs).

Not fixed (with rationale):

  • fsync before close: intentionally omitted. Lance's object-store commit protocol provides durability guarantees at the dataset level (manifest write). Adding fsync per data file would eliminate the throughput advantage this PR is benchmarking. For production use, a higher-level sync_data on Lance commit would be the right place.
  • Short writes: the existing post-loop written < total check already surfaces the error to the caller. Per-SQE retry would require tracking partial-write offsets across the io_uring completion queue — a meaningful refactor. Given that short writes on regular files on Linux are essentially a theoretical concern (the kernel guarantees full WRITE ops on buffered regular files), this is deferred.
  • PutOptions / PutMultipartOptions: attributes and tags are not used by Lance for local columnar data writes. Renamed unused opts to _opts with a comment; the put_multipart_opts fallback path correctly passes opts to the inner store.
  • Anchor ring QD: the anchor ring's sole purpose is to share the kernel worker pool, not to submit I/O. Its QD only determines the size of the shared pool — using the first file's calibrated QD (rather than a fixed constant) is fine for our single-device benchmark setup. Documented in a comment.

@tsafin tsafin merged commit b1cf2d0 into master Mar 2, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants