fix(container-runtime): block local submits during stashed-op apply#27297
fix(container-runtime): block local submits during stashed-op apply#27297anthony-murphy wants to merge 19 commits into
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (585 lines, 5 files), I've queued these reviewers:
How this works
|
PendingStateManager now exposes a one-way apply lifecycle that opens eagerly when stashed state is present and closes when initialMessages drains. ContainerRuntime aggregates this into isReadOnly() and fans it out via notifyReadOnlyState so compliant DDSes skip submits during stashed-op replay. If a DDS bypasses the readonly gate and submits anyway, submit() throws a fatal UsageError at the call site rather than letting a poison entry race the saved-op replay and mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isReadOnly() now uses optional-chain on pendingStateManager since the baseLogger evaluates this lazy property in a constructor window before the PSM is assigned, which would throw TypeError on error telemetry. applyStashedOpsAt() now tracks loop completion explicitly; a failure on the last stashed op no longer flips the lifecycle to "ended" (the failing message is shifted off initialMessages before apply runs, so the queue otherwise looks drained). New regression test covers the failure-on-last-op case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cce5286 to
7b813f4
Compare
…ings ESLint rule flagged indented continuation lines inside two block comments on PendingStateManager. Reflowed both: the hooks-interface comment uses separate paragraphs instead of a bulleted list with indented continuations, and the lifecycle comment runs as flat prose instead of indented bullets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ger too The baseLogger now stamps isReadOnly on every error event. During construction, error events fire (e.g. layer-compat failures via validateLoaderCompatibility) before _deltaManager is assigned, so the prior fix's optional chain on pendingStateManager wasn't enough — the deltaManager.readOnlyInfo read threw TypeError and masked the layer incompat error, breaking runtimeLayerCompatValidation tests in CI. Switching to this._deltaManager?.readOnlyInfo lets the lazy property return false (well-defined) during the partial-construction window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove `onBeforeFirstStashedOpApply` hook entirely. It was a no-op at
the only call site (channelCollection is not yet constructed when the
PSM constructor runs), and the misleading comment risked masking a
future regression. PSM still flips `_applyLifecycle` eagerly; no fanout
fires until the close hook.
- Allowlist `BlobAttach` and `IdAllocation` message types in the
submit-during-apply guard. `BlobAttach` is produced by
`sharePendingBlobs`, which runs before `applyStashedOpsAt` resolves.
`IdAllocation` is already excluded by a downstream assert but is
allowlisted here defensively.
- Add `Fluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrow` kill
switch. The UsageError is always constructed; when the kill switch is
enabled it is sent as a `SubmitDuringStashedOpApply` error event
instead of thrown, leaving an off-switch for production rollout.
- Enforce the PSM-documented invariant ("never resubmit during apply
stashed ops") with an explicit early-return in `replayPendingStates`
when `isApplyingStashedOps` is true. Closes the connection-transition
mid-partial-drain hole where `reSubmit` for `Rejoin` / `GC` would have
hit the submit guard.
New tests cover BlobAttach during apply (allowlisted, no throw), the
kill switch (no throw, error event logged), and `replayPendingStates`
no-op during apply.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `replayPendingStates` apply-window early-return suppresses replay during apply, but `setConnectionState(true)` only fires `replayPendingStates` on the `canSendOps` *edge*. If that edge fires mid-apply (or while stashed entries are still being pushed into `pendingMessages` by the apply loop), the trigger is consumed without effect and nothing re-fires it after the window closes. Have `onAfterStashedOpsApplied` call `replayPendingStates` after `notifyReadOnlyState`. The method self-gates on `shouldSendOps()`, and the PSM sets `_applyLifecycle = "ended"` before firing the hook so the apply-window early-return won't fire either. New test asserts the runtime's `replayPendingStates` is invoked when the close hook fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the `replayPendingStates` apply-window early-return and the `onAfterStashedOpsApplied` re-trigger. The canSendOps edge in `setConnectionStateCore` is the only call site for `replayPendingStates`, and the loader contract awaits `applyStashedOpsAt` before transitioning the runtime to a write-capable connection — so the mid-apply edge the defensive code was guarding against cannot occur. Replace the early-return with an assert that fails loudly if the invariant is ever violated by a future loader change, surfacing the contract rather than silently swallowing the edge. Remove the two now-obsolete tests (`replayPendingStates is a no-op during apply`, `close hook re-triggers replayPendingStates after apply drains`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `submit()` now unconditionally sends `SubmitDuringStashedOpApply` as an error event on every bypass; the kill switch `DisableSubmitDuringStashedApplyThrow` only suppresses the throw + close. Telemetry attribution is identical whether or not the kill switch is engaged. - Existing throw test now also asserts the error event fires, so symmetry with the kill-switch test is explicit. - `PendingStateManager.applyStashedOpsAt` no longer uses a `loopCompleted` flag + `try/finally`. The close block (set `_applyLifecycle = "ended"`, fire hook) only ran when the loop completed without throwing — moving it inline after the loop preserves that property and drops the indirection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end coverage for the stashed-op apply submit guard: the original test in `container-runtime` only verified that `submit()` throws when `isApplyingStashedOps` is true. This drives the throw through a realistic listener pattern and asserts the resulting load rejection. Two SharedMaps live in the same data store. A `valueChanged` listener on the primary map performs a cascading `set` on the secondary map. Stashed offline ops replay on the primary during `applyStashedOpsAt`, firing `valueChanged`; the cascading `set` on the secondary reaches `ContainerRuntime.submit()` (the channel-level `stashedOpMd` capture only swallows submits from the channel currently in `applyStashedOp`, not cross-channel writes) and the guard rejects the load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spell out in the factory's JSDoc that submitting ops from DDS op-event handlers is bad practice and that any cascading write must gate on both `IContainer.readOnlyInfo.readonly` and `IFluidDataStoreRuntime.activeLocalOperationActivity` to be safe during stashed-op replay and rollback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDS handlers should gate on `IFluidDataStoreRuntime.isReadOnly()` (with the `"readonly"` event as the live signal), not on `IContainer.readOnlyInfo.readonly` / `ContainerRuntime.isReadOnly()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Don't list the conditions under which `isReadOnly()` returns `true` — the contract for handlers is simpler: if it's `true`, don't submit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled `LocalDocumentServiceFactory` / `LocalResolver` / `LocalCodeLoader` / `ContainerRuntimeFactoryWithDefaultDataStore` setup with `createLoader` from `src/utils.ts`. Same coverage, fewer imports, matches the convention in the rest of `local-server-tests`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the explicit `urlResolver` / `documentServiceFactory` pass-through on the second loader. Sharing the `deltaConnectionServer` is sufficient for URL resolution; each `createLoader` call can produce its own resolver and driver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `jsdoc/check-indentation` ESLint rule disallows indented continuation lines (and numbered lists) inside JSDoc bodies. Flatten the isReadOnly / activeLocalOperationActivity bullet list into a single paragraph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The submit-guard `throw` is outside the existing `submit()` try/catch, so without an explicit `closeFn` call the container only closes when the caller's promise chain wraps the rejection in `.catch(closeFn)`. The inline comment and the kill-switch contract both state "throw + close", but the code only delivered the throw — close happened transitively. Add `this.closeFn(error)` immediately before `throw error;`. `closeFn` is idempotent so a caller that also closes won't double-close. Extend the existing throw test to assert `closeFn` was invoked exactly once with the UsageError, so the contract is regression-protected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop `IdAllocation` from the stashed-op-apply guard allowlist. The
downstream assert 0x9a5 ("IdAllocation should be submitted directly to
outbox") already enforces that `IdAllocation` never reaches `submit()`
at all, so the allowlist entry was dead code. If that contract ever
shifts, the assert is the natural place a future author will see the
impact and decide whether to also allowlist for the apply window —
better than a forgotten reference in a comment block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deep ReviewReviewed commit Readiness: 9/10 — ALMOST READY All correctness concerns from the prior ten rounds are resolved at this HEAD; every one of the 14 posted inline threads is closed. Three Tier 3 polish items remain, all contained one-line edits — flagged inline. Path to Ready
Context for Reviewers
For human reviewer
Review history (10 prior reviews)
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces a guarded “stashed-op apply window” so DDSes can’t submit local ops during stashed-op replay, preventing pending-local-state mismatches and surfacing violations as attributable errors.
Changes:
- Added a one-way apply lifecycle to
PendingStateManagerwith a close hook when stashed ops fully drain. - Updated
ContainerRuntimereadonly aggregation andsubmit()to detect/handle submits during apply (with allowlist + kill switch). - Added unit + end-to-end tests covering lifecycle, submit behavior, and the mismatch scenario.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/test/local-server-tests/src/test/submitDuringStashedOpApply.spec.ts | New end-to-end repro asserting load rejects when a listener submits during stashed-op apply. |
| packages/runtime/container-runtime/src/test/pendingStateManager.spec.ts | New lifecycle tests + adjusted existing test setup to avoid flush during apply window. |
| packages/runtime/container-runtime/src/test/containerRuntime.spec.ts | New runtime tests for submit-during-apply behavior, allowlist, and kill switch. |
| packages/runtime/container-runtime/src/pendingStateManager.ts | Adds apply lifecycle state + optional close hook and early return optimization. |
| packages/runtime/container-runtime/src/containerRuntime.ts | Aggregates readonly with apply-window state, reworks baseLogger tagging, adds submit guard + replay assert, adjusts readonly notifications. |
| * Synchronous: fires from a `finally` block where async behavior would | ||
| * complicate error propagation. |
| // recoverable state. | ||
| if (this._applyLifecycle === "applying" && this.initialMessages.isEmpty()) { | ||
| this._applyLifecycle = "ended"; | ||
| this.hooks.onAfterStashedOpsApplied?.(); |
| // `channelCollection` may be undefined when invoked from the PSM's | ||
| // open hook, which fires from `new PendingStateManager(...)` earlier | ||
| // in this constructor than `channelCollection` is assigned. That's | ||
| // fine — DDSes created after this point will read the runtime's | ||
| // `isReadOnly()` aggregation and start out in the correct state. |
|
|
||
| // 1. Create the container with two named SharedMaps, attach, write, | ||
| // disconnect, write offline, capture pending local state. | ||
| const goodFactory = new TestFluidObjectFactory( | ||
| [ | ||
| ["primary", SharedMap.getFactory()], | ||
| ["secondary", SharedMap.getFactory()], | ||
| ], | ||
| "default", | ||
| ); | ||
| const { | ||
| codeDetails, | ||
| loaderProps: goodLoaderProps, | ||
| urlResolver, | ||
| } = createLoader({ | ||
| deltaConnectionServer, | ||
| defaultDataStoreFactory: goodFactory, | ||
| }); | ||
|
|
||
| const container = asLegacyAlpha( | ||
| await createDetachedContainer({ codeDetails, ...goodLoaderProps }), | ||
| ); | ||
|
|
||
| const initialObject = (await container.getEntryPoint()) as ITestFluidObject; | ||
| const primary = await initialObject.getSharedObject<ISharedMap>("primary"); | ||
| primary.set("pre-attach", "value"); | ||
|
|
||
| await container.attach(urlResolver.createCreateNewRequest("submit-during-apply")); | ||
| primary.set("attached", "value"); | ||
|
|
||
| const url = await container.getAbsoluteUrl(""); | ||
| assert(url !== undefined, "container should have a URL after attach"); | ||
|
|
||
| container.disconnect(); | ||
| primary.set("offline", "value"); | ||
|
|
||
| const pendingLocalState = await container.getPendingLocalState(); | ||
| container.close(); | ||
|
|
||
| // 2. Build a separate loader that, on existing=true loads, wires up a | ||
| // valueChanged listener on `primary` that performs a cascading set | ||
| // on `secondary`. Share the resolver and driver so the URL produced | ||
| // above resolves on the new loader. | ||
| const { loaderProps: reactingLoaderProps } = createLoader({ | ||
| deltaConnectionServer, | ||
| defaultDataStoreFactory: new ReactingMapFactory(goodFactory), | ||
| }); | ||
|
|
||
| // 3. The stashed `offline` op fires `valueChanged` on `primary` during | ||
| // apply; the listener's `secondary.set` reaches the runtime's | ||
| // submit guard (a different channel from the one in applyStashedOp, | ||
| // so the channel-level stashedOpMd capture doesn't swallow it), and | ||
| // the load rejects. | ||
| await assert.rejects( | ||
| loadExistingContainer({ | ||
| ...reactingLoaderProps, | ||
| request: { url }, | ||
| pendingLocalState, | ||
| }), | ||
| (error: Error & { message?: string }) => | ||
| error.message?.includes("Local op submitted during stashed-op apply window") === true, | ||
| ); |
| pendingStateManager.pendingMessages.push({ | ||
| type: "message", | ||
| content: '{"type":"component"}', | ||
| referenceSequenceNumber: message.referenceSequenceNumber, | ||
| localOpMetadata: undefined, | ||
| opMetadata: undefined, | ||
| batchInfo: { | ||
| clientId: "CLIENT_ID", | ||
| batchStartCsn: 0, | ||
| length: 1, | ||
| staged: false, | ||
| }, | ||
| runtimeOp: message.runtimeOp, | ||
| }); |
Summary
PendingStateManagerexposes a one-way apply lifecycle (notStarted→applying→ended). The window opens eagerly in the constructor when stashed state is present and closes the first timeapplyStashedOpsAtdrainsinitialMessagessuccessfully. An apply error leaves the lifecycle atapplying.ContainerRuntime.isReadOnly()aggregatespendingStateManager.isApplyingStashedOpsand fans out vianotifyReadOnlyState, so compliant DDSes skip submits during stashed-op replay.ContainerRuntime.submit()always emits aSubmitDuringStashedOpApplyerror event on a bypass during the apply window, then throws a fatalUsageErrorunless the kill switchFluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrowis set.BlobAttachandIdAllocationare allowlisted as legitimate runtime-internal traffic during apply. Checking at submit time (rather than atonFlushBatch) closes the gap where a deferred flush would land after the apply window closes and mask the misuse.ContainerRuntime.replayPendingStatesasserts that the apply window is not open. The only real caller issetConnectionStateCore'scanSendOpsedge, and the loader awaitsapplyStashedOpsAtbefore transitioning to a write-capable connection — the assert surfaces a contract violation rather than silently swallowing the edge.Background
When a host calls
loadContainerwithpendingLocalState, the runtime replays stashed ops viaapplyStashedOpsAt. If a DDS submits a local op during that replay — typically a realize-time write that doesn't consultreadOnly— the new op has no counterpart in the saved-op stream and the next saved-op replay throws "pending local message content mismatch". This change makes the runtime effectively readonly for the duration of the replay and converts the bypass case into an immediate, attributable failure, with a kill switch for production rollout safety.Risks & tradeoffs
baseLoggerreshape (incidental).ContainerRuntime.baseLoggeris now built viacreateChildLoggerwith always-onerrorproperties (inStagingMode,isApplyingStashedOps,isReadOnly). Every error event flowing throughbaseLogger— including downstream consumers — now carries these tags. Behavior change for telemetry pipelines reading rawbaseLoggeroutput.Fluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrowis default-off (throw + close on bypass). When set, theUsageErroris still constructed and surfaced as an error event but is not thrown — preserves attribution while leaving an off-switch if a first- or third-party DDS in production quietly bypasses the readonly gate.Test plan
applyStashedOpsAtcalls, and lifecycle staying"applying"when the final apply throws.FluidDataStoreOpthrows + logs error event,BlobAttachallowlisted (no throw), kill switch suppresses throw and still logs, no throw outside the apply window.🤖 Generated with Claude Code — signed: Claude (Opus 4.7)