[All] Graceful close on arrow streams by elenagaljak-db · Pull Request #258 · databricks/zerobus-sdk

elenagaljak-db · 2026-04-29T13:13:31Z

What changes are proposed in this pull request?

Adds stream_paused_max_wait_time_ms config option to ArrowStreamConfigurationOptions across all SDKs (Rust, Go, Python, TypeScript, Java)
Implements client-side graceful close state machine: when the server signals it will close the stream, the SDK enters a paused state, stops sending new batches but drains in-flight ack responses before triggering recovery
Graceful close recoveries don't count against the recovery_retries budget, since they're intentional server-side restarts (e.g. rolling deploys)

How is this tested?

5 new integration tests in rust/tests/src/arrow_tests.rs covering: full server duration wait, immediate recovery, client-capped wait, early exit when all batches acked, ingest accepted during grace period

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

teodordelibasic-db · 2026-05-04T14:15:04Z

+                                        ));
+                                    }
+
+                                    is_paused.store(true, Ordering::Relaxed);


I think the is_paused flag is not actually checked where it should be checked. We flip it to true here once a close signal is received, but send_flight_data_internal never reads it. Ingest calls during the grace period continue to push FlightData onto the existing batch_tx mpsc channel and out over the DoPut stream that the server has already announced it is closing.

In send_flight_data_internal, after appending to pending_batches, we can skip the sender.send(...) loop while is_paused.load(Relaxed) is true (still return Ok(offset_id)). The supervisor will initialize a fresh batch_tx on recovery and the pending batches will be replayed.

teodordelibasic-db · 2026-05-04T14:30:02Z

+   * Returns the maximum time in milliseconds to wait during graceful stream close.
+   *
+   * <p>When the server sends a close stream signal, the SDK enters a "paused" state where it stops
+   * accepting new batches but continues processing acknowledgments for in-flight batches.


I think this is a bit confusingly worded, the SDK doesn't stop accepting batches, it just stops sending them.

teodordelibasic-db · 2026-05-04T14:42:46Z

            flush_timeout_ms: opts.flush_timeout_ms.map(|v| v as u64).unwrap_or(default.flush_timeout_ms),
            connection_timeout_ms: opts.connection_timeout_ms.map(|v| v as u64).unwrap_or(default.connection_timeout_ms),
            ipc_compression,
+            stream_paused_max_wait_time_ms: opts.stream_paused_max_wait_time_ms.map(|v| v as u64),


Since TS SDK depends on a pinned Rust SDK version, I don't think we should update TS here, but instead first release a new Rust SDK version and then use that version in TS.

teodordelibasic-db · 2026-05-04T14:48:56Z

+    /// Maximum time in milliseconds to wait during graceful stream close.
+    /// None = wait full server duration, 0 = immediate recovery, >0 = wait up to min(this, server_duration).
+    #[pyo3(get, set)]
+    pub stream_paused_max_wait_time_ms: Option<i64>,


Let's just add a negative check in to_rust like we do for other fields.

teodordelibasic-db · 2026-05-07T16:36:39Z


+                // Capture whether this was a graceful close recovery before resetting.
+                // Graceful close recoveries should not consume the retry budget.
+                let was_graceful_close = is_paused.load(Ordering::Relaxed);


The supervisor decides whether to skip retry-budget accounting by reading
is_paused after process_acks returns:

But for two of the three close-signal branches in process_acks,
is_paused is never set to true before the function returns:

stream_paused_max_wait_time_ms == Some(0) - returns at line 1205 without
touching is_paused.

wait_duration_ms == 0 (server says "close now", or
min(client_max, server_duration) == 0) - returns at line 1218 without
touching is_paused.

In both cases was_graceful_close is observed false, so the supervisor
falls through to the else branch at line 836 and runs
recovery_attempts.fetch_add(1, Relaxed) - consuming a retry slot.

elenagaljak-db added 2 commits April 29, 2026 16:47

Graceful close on arrow streams

2b905b2

fmt

e5de6d8

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

elenagaljak-db force-pushed the arrow-graceful-stream-close branch from e64589d to e5de6d8 Compare April 29, 2026 14:47

elenagaljak-db added 2 commits April 29, 2026 18:44

fmt

0bfdfd3

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

next_changelogs

f1704fc

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

teodordelibasic-db reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[All] Graceful close on arrow streams#258

[All] Graceful close on arrow streams#258
elenagaljak-db wants to merge 4 commits intomainfrom
arrow-graceful-stream-close

elenagaljak-db commented Apr 29, 2026

Uh oh!

teodordelibasic-db May 4, 2026

Uh oh!

teodordelibasic-db May 4, 2026

Uh oh!

teodordelibasic-db May 4, 2026

Uh oh!

teodordelibasic-db May 4, 2026

Uh oh!

teodordelibasic-db May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elenagaljak-db commented Apr 29, 2026

What changes are proposed in this pull request?

How is this tested?

Uh oh!

teodordelibasic-db May 4, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db May 4, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db May 4, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db May 4, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants