Skip to content

feat(dds): squash-on-resubmit for non-tree DDSes#27318

Draft
anthony-murphy wants to merge 22 commits into
microsoft:mainfrom
anthony-murphy:dds-squashing
Draft

feat(dds): squash-on-resubmit for non-tree DDSes#27318
anthony-murphy wants to merge 22 commits into
microsoft:mainfrom
anthony-murphy:dds-squashing

Conversation

@anthony-murphy
Copy link
Copy Markdown
Contributor

Summary

Implements reSubmitSquashed across all non-tree DDSes so staging-mode commits (commitChanges({ squash: true })) drop intermediate values before they reach the wire. Values written and removed within a single staging session — e.g. a sensitive string set and then deleted before commit — are no longer transmitted as part of the squashed batch. With this change, the listed DDSes no longer depend on the Fluid.SharedObject.AllowStagingModeWithoutSquashing config flag fallback.

Per-DDS treatment

DDS Treatment
SharedCell, SharedCounter, SharedMap, SharedDirectory, SharedMatrix Content-aware per-cell / per-key LWW squash
SharedTaskManager Explicit override delegating to reSubmitCore (already collapses volunteer/abandon pairs by disconnect semantics)
SharedSequence / intervals Unchanged — squash was already wired end-to-end via merge-tree's regeneratePendingOp(squash)
Ink, ConsensusRegisterCollection, ConsensusOrderedCollection, PactMap, legacy SharedArray, legacy SharedSignal Identity squash with documented rationale (append-only, order-preserving, or consensus-bound semantics)

The mixed-lifetime case

For Map/Directory, PendingKeyLifetime groups consecutive sets on the same key with no intervening delete or clear. A pre-staging set(k, A) followed by a staging set(k, B) lands in one lifetime, with keySet A pre-staging and keySet B staging.

replayPendingStates({ committingStagedBatches: true }) only replays staged batches — pre-staging messages remain in flight at the runtime layer. The naive approach of wiping the entire lifetime during squash would lose pre-staging keySet A from kernel state, so when its ACK eventually arrives, the kernel finds the squashed lifetime instead and asserts 0xc07: Got a local set message we weren't expecting.

The fix in MapKernel.squashPendingDataForBatch / SubDirectory.squashPendingStorageForBatch:

  1. Locate the staging boundary — either a standalone entry, or a (lifetimeIdx, keySetIdx) pair inside a lifetime.
  2. Compute finalKeyOps first, including the mixed lifetime's staging suffix (its last keySet provides the candidate for that key, unless a later clear in the staging slice nullifies it).
  3. Then mutate: truncate the mixed lifetime to its pre-staging prefix (splice the lifetime entirely if its prefix is empty); drop the pure-staging entries.
  4. Submit the squashed ops.

Order matters — computing before mutating preserves the staging-suffix value needed for LWW.

The same shape (boundary detection, suffix-aware finalKeyOps, mutate-after) is used in both MapKernel and SubDirectory.

Test coverage

  • Unit tests: new squash.spec.ts (Map, Directory) and matrix.squash.spec.ts (Matrix), plus inline blocks in cell.spec.ts and counter.spec.ts. Cover set-then-delete, set-then-set chains, clear-in-staging, delete-after-clear, single-pending pass-through.
  • Stress: local-server-stress-tests now plumbs squash: random.bool() through ExitStagingMode ops, so the 200-seed default suite exercises end-to-end squash across all wired DDSes. The stress run is what caught the mixed-lifetime bug above — both the disposed-subdirectory guard and the keySet-boundary split were added in response to seeds it generated.
  • Test helper: reconnectAndSquash is now exported from @fluid-private/test-dds-utils for per-DDS unit-test use. It was previously internal to the fuzz harness.

Result: 4,440 unit tests across 12 DDS packages pass, plus 200/200 stress seeds in the local-server-stress default workload.

Known scope limitations (intentional)

  • SharedDirectory subdir create/delete lifecycle: not squashed (the create/delete ops carry no user content, so this is a future optimization rather than a leak fix). Existing reSubmitCore handles them.
  • SharedTaskManager: in-staging abandon for a task volunteered pre-staging isn't propagated by this squash path — reSubmitCore assumes the server has already removed the client from the queue, which is consistent with TaskManager's overall connection-bound design.
  • legacy SharedArray: identity squash. The insertEntry op carries the entry's user-supplied value, so an inserted-then-deleted entry within one staging session can still surface its value on commit. A full per-entry squash would mirror SharedMap's complexity in a DDS marked "not intended for use in new code" — left as a future task with a documented caveat in the code.
  • ConsensusRegisterCollection: writes participate in version history exposed via readVersions(). Collapsing pending writes would change observable semantics; intentional no-squash.

Test plan

  • pnpm install + fluid-build for every touched package
  • Unit tests pass for cell, counter, task-manager, map (Map + Directory), matrix, sequence, ink, register-collection, ordered-collection, pact-map, legacy-dds
  • local-server-stress-tests 200-seed default workload passes with randomized squash on every staging commit
  • CI typecheck / lint / API-extractor across all affected packages

🤖 Generated with Claude Code

All non-tree DDSes now have explicit reSubmitSquashed overrides so
staging-mode commits (commitChanges({squash: true})) can drop
intermediate values before they reach the wire.

- SharedCell, SharedCounter, SharedMap, SharedDirectory, SharedMatrix:
  content-aware per-cell / per-key LWW squash. For Map/Directory, the
  staging-mode boundary may fall inside a shared PendingKeyLifetime
  (pre-staging and staging sets on the same key share one lifetime).
  The implementation truncates the lifetime to its pre-staging prefix
  and computes the squashed final value from the staging suffix,
  preserving pre-staging keySets that are still in flight at the
  runtime layer.
- SharedTaskManager: explicit override delegating to reSubmitCore,
  which already collapses volunteer/abandon pairs.
- Ink, ConsensusRegister, ConsensusOrderedCollection, PactMap,
  legacy SharedArray, legacy SharedSignal: identity squash with
  documented rationale (append-only, order-preserving, or
  consensus-bound semantics).

The local-server-stress-tests harness now randomizes
commitChanges({squash}) on staging-mode exit; the 200-seed default
suite exercises end-to-end squash across all wired DDSes.

reconnectAndSquash is exported from @fluid-private/test-dds-utils for
per-DDS unit-test use. New unit tests across Cell, Counter, Map,
Directory, and Matrix cover set-then-delete, set-then-set, clear
interactions, and single-pending pass-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (3785 lines, 41 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Comment thread packages/dds/counter/src/counter.ts
Comment thread packages/dds/map/src/mapKernel.ts Outdated
Reworked SharedMap, SharedDirectory, and SharedCounter to address
deep-review feedback: instead of a batched pendingData rewrite, the
runtime walks each staged change oldest-to-newest and the DDS answers
"is this change subsumed by a later staged change?" Subsumed → splice
the tracking out of pendingData (targeted rollback-style cleanup);
not subsumed → normal resubmit. Pre-staging ops still in flight are
never touched.

Counter has no subsumption (each increment is intentional intent), so
its reSubmitSquashed is now identity. This fixes the bug where the
prior implementation summed across pre-staging entries and could fire
0xc8f when a pre-staging increment was still in flight at squash time.

SharedMap/SharedDirectory subsumption rules:
- A clear is subsumed only by a later clear.
- A delete for key k is subsumed by any later clear, delete for k, or
  lifetime for k.
- A set for key k (a keySet inside a lifetime) is subsumed by any
  later keySet in the same lifetime, or by any later clear, delete
  for k, or lifetime for k.

This makes the mixed-lifetime case (pre-staging and staging sets on
the same key sharing one PendingKeyLifetime) fall out for free — we
never wipe the lifetime, just splice out subsumed keySets.

New pre-staging-in-flight regression tests for Counter, Map, and
Directory cover the case that the prior reconnectAndSquash-only
harness was masking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/dds/map/src/directory.ts
anthony-murphy and others added 3 commits May 15, 2026 10:47
# Conflicts:
#	packages/dds/legacy-dds/src/signal/sharedSignal.ts
#	packages/dds/ordered-collection/src/consensusOrderedCollection.ts
#	packages/dds/register-collection/src/consensusRegisterCollection.ts
#	packages/test/local-server-stress-tests/src/baseModel.ts
#	packages/test/local-server-stress-tests/src/stressDataObject.ts
Per deep-review comment #3249886407: a staged
`set(subdir, "secret") -> deleteSubDirectory(subdir) -> commitChanges({squash: true})`
sequence still shipped "secret" on the wire. `subdir.disposed` only
flips on a *sequenced* delete; for a pending-only delete the flag
stays false, so SharedDirectory.reSubmitSquashed's old early-return
guard was skipped and the storage op fell through to reSubmitCore.

Fix: fold the subdir-reachability check into
`SubDirectory.dropIfSubsumedByLaterStorageOp`. If the subdirectory
is disposed OR no longer reachable in the optimistic view (i.e. a
pending or sequenced deleteSubDirectory has removed it from the
tree, on this subdir or any ancestor), every storage op on it is
treated as subsumed: its tracking is spliced out of
pendingStorageData and the op is dropped. `isNotDisposedAndReachable`
already consults `getWorkingDirectory(absolutePath)` which encodes
pending-delete visibility, so no separate ancestor walk is needed.

This also makes the disposed/non-disposed branches symmetric — both
splice on subsumption — so the kernel state doesn't leak entries.

New regression test: enterStaging -> set(subdir, "secret") ->
deleteSubDirectory(subdir) -> commitChanges({squash: true}) asserts
the subdir is removed on the peer and "secret" never appears as a
valueChanged event value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the algorithmic-complexity review concern: PendingStateManager
already does an O(N) walk to drive resubmits, and squash's
dropIfSubsumed did its own O(N) walk over pendingData per call —
plus a nested O(K) keySets.indexOf scan inside it. In the worst
case (many staged sets on one key, all in one PendingKeyLifetime),
the nested scan dominates: O(K^2) just to locate the keySets.

Fix: add a `lifetime: PendingKeyLifetime` back-pointer to
PendingKeySet. Squash can now jump directly to the containing
lifetime via `metadata.lifetime` and check tip status with a
single reference comparison
(`metadata === lifetime.keySets[length - 1]`). For the common
non-tip case (a keySet superseded by a later keySet in the same
lifetime) the entire decision is O(1) — no pendingData scan and
no keySets.indexOf.

The tip case and standalone clear/delete metadata still need a
pendingData scan to determine ordering for the subsumption walk,
so the absolute worst case is unchanged. But the
many-sets-on-one-key degenerate pattern that the back-pointer
specifically targets goes from O(K^2) to O(K) for the locate
phase. Splice within keysets remains O(K) — a future linked-list
or head-offset change in the keysets array would be needed to
push the total past O(K).

No behavior change; all existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Claude here, following up on the algorithmic-complexity concern.

Inner-walk audit per DDS

DDS Inner work per reSubmitSquashed call Degenerate input Per call Total
SharedCounter identity — delegates to reSubmitCore O(1) O(N) ✓
SharedCell pendingMessageIds.indexOf + splice many staged sets on one cell O(N) O(N²)
SharedMap / SharedDirectory walk pendingData to locate metadata, walk again for subsumption, nested keySets.indexOf inside the locate many staged sets on one key (all in one PendingKeyLifetime) O(N) — and O(K²) inside the lifetime O(N²)
SharedMatrix per-cell local.findIndex + splice many staged sets on one cell O(K) O(K²) per cell
SharedTaskManager per-task latestPendingOps[].findIndex many staged volunteer/abandon on one task O(P) O(P²) per task
SharedSequence merge-tree internal already amortized via merge-tree

Small change in 0d5957c

PendingKeySet now has a lifetime: PendingKeyLifetime back-pointer. dropIfSubsumed jumps to the containing lifetime in O(1) via metadata.lifetime and does the tip check by reference comparison (metadata === lifetime.keySets[length - 1]). No pendingData walk and no nested keySets.indexOf in the common non-tip case.

That collapses the locate phase for the "many sets on one key" pattern (everything in one lifetime) from O(K²) to O(K). The tip case and standalone clear / delete metadata still scan pendingData once (a single indexOf), and the subsumption walk for "any later entry affecting this key" remains O(N-i).

Remaining O(N²) risk and possible follow-ups (intentionally not in this PR)

The current code is still O(N²) in two narrower worst cases:

  1. Many staged ops with mixed types in pendingData — the pendingData.indexOf for standalone metadata and the subsumption walk for "later entry affecting this key" are still O(N) per call. The fix would be either generation tracking (lastEntryGenerationByKey: Map<string, number> + lastClearGeneration: number, updated on push) or a per-key doubly-linked chain through pendingData entries. Both add maintenance overhead on every set/delete/clear submit, so I'd want signal that it's worth taking before doing it.

  2. Many sets on one key — splice within lifetime.keySets — even with the back-pointer, dropping a non-tip keySet is still keySets.splice(idx, 1), which is O(K) for the shift. Total O(K²) for K drops in a row. The fix is either a linked-list keysets or a head-offset trick (keySetsHead: number + lazy compaction). Both are more invasive than the back-pointer because the existing iterator API also reads from keySets.

SharedCell, SharedMatrix, SharedTaskManager have analogous O(N) inner walks but their N is bounded per cell / per task in practice — happy to look at any of them if you'd like the same treatment.

The headline guarantee — staged inserted-then-deleted secrets don't reach the wire — is unchanged by these complexity considerations. Pushed in 0d5957c; 932/932 SharedMap+SharedDirectory unit tests pass, 199/199 stress seeds pass.

Cross-DDS audit (every op type x every code path) surfaced three
actionable items:

1. SharedCell lacked the pre-staging-in-flight regression test that
   Counter, Map, and Directory have. Cell's per-op squash logic was
   already correct, but symmetry matters: add a test that does
   set(pre) connected, disconnect, set(secret) + delete during
   staging, commit-squash, and asserts (a) pre lands normally and
   (b) "secret" never reaches the peer.

2. ConsensusOrderedCollection.add carries a serialized user value.
   The identity-squash comment didn't document the
   `add(secret) -> acquire -> complete` leak vector. Updated the
   comment to call it out, matching how SharedSignal / SharedArray /
   ConsensusRegisterCollection document their analogous leaks.

3. SharedSequence segment-/interval-property channel is not
   squashed by merge-tree even when the containing op is — a
   pre-existing gap not introduced by this PR. Documented in the
   changeset as a known limitation alongside the other
   intentional-leak callouts.

No behavior change for any DDS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Claude here — preemptive cross-DDS audit while deep review runs. Ran an op-type-by-op-type sweep across every DDS in scope: enumerate each operation interface, find every submit / process / resubmit site, and verify the squash decision against each op type. Pushed in 542ed4d.

Verdicts per DDS

DDS Op types audited Verdict
SharedCell setCell, deleteCell ✓ correct; both op types covered by LWW tip-check
SharedCounter increment ✓ correct; only op type; identity squash sound
SharedMap set, delete, clear ✓ correct; subsumption rules cover all three
SharedDirectory storage set, delete, clear ✓ correct; same as Map plus subdir-reachability guard
SharedDirectory subdir lifecycle createSubDirectory, deleteSubDirectory ✓ no leak (only subdirName string is on the wire); intentionally not squashed (idempotent server-side)
SharedMatrix setCell, row/col vector ops ✓ correct; setCell per-cell LWW; vector ops delegate to merge-tree's squash=true
SharedTaskManager volunteer, abandon, complete ✓ correct; reSubmitCore already collapses pairs by design
PactMap set, accept ✓ safe via refSeq filtering in reSubmitCore
Ink createStroke, stylus, clear ✓ documented intentional leak
legacy SharedArray insertEntry, deleteEntry, moveEntry, toggle, toggleMove insertEntry leak documented; the other four carry only ids
legacy SharedSignal signal ✓ documented intentional leak
SharedSequence text / intervals (text + endpoint channels) INSERT, REMOVE, ANNOTATE, OBLITERATE, GROUP, interval ADD/CHANGE/DELETE ✓ correct for text and endpoints via merge-tree's regeneratePendingOp(squash)

Actionable findings, addressed in 542ed4d

  1. SharedCell lacked a pre-staging-in-flight regression test (Counter/Map/Directory had one). Added: pre-staging set(\"pre\") → disconnect → set(\"secret\") + delete during staging → commit-squash → assert \"secret\" never reaches the peer. Per-op squash logic was already correct; the test closes a coverage gap.

  2. ConsensusOrderedCollection.add leak was undocumented. Identity squash transmits the add op on commit, so add(secret) → acquire → complete inside a staging session sends secret. Comment updated to call this out alongside the SharedSignal / SharedArray / RegisterCollection callouts. No code change (the leak is intentional given the queue's order-dependent semantics — same logic as the others).

Known limitations, documented as not addressed in this PR

  1. SharedSequence property channel is not squashed by merge-tree. Text content + interval endpoints squash correctly: insertText(0, \"secret\")removeRange(0, ...) → squash drops the insert via squashInsertion() (merge-tree client.ts:1423-1442). But the property channel does not get the same treatment:

    • annotateRange(0, 1, {foo: \"secret\"}) → annotateRange(0, 1, {foo: \"public\"}) — both ops resubmit; the squash flag is unused inside merge-tree's annotate handling.
    • insertText(0, \"x\", {handle: poison}) with no later removal — the insert's properties resubmit verbatim (correct — the user committed them); but the inserted-then-removed case where squashInsertion() fires does suppress the segment data including its properties (no leak).
    • Interval CHANGE ops with property-only changes bypass squash logic entirely (endpointChangesNode === undefined short-circuits in rebaseLocalInterval).

    This is a pre-existing merge-tree gap, not introduced by this PR. The existing sharedString.squash.fuzz.spec.ts already notes "once we support squashing property changes, we should record additional information in poisonedHandleLocations" (line 156-158), confirming this was already known as future work. Adding it to the changeset's known-limitations list so it's explicit at release.

  2. Subdir create+delete collapse: a staged deleteSubDirectory(x) → createSubDirectory(x) (same name) sends both ops. They're idempotent server-side and the wire ops only carry subdirName — no data-leak surface. Left as a future optimization rather than a correctness fix.

Test posture after this commit

  • 932 + 13 + 217 + 791 + 23 + 217 + 234 + 14 + 94 + 260 + 1664 = 4,459 unit tests across 12 DDS packages, all green
  • 199/199 stress seeds pass
  • All three deep-review threads resolved

`subdirName` is user-supplied content (e.g. user ids, tenant slugs).
A staged `createSubDirectory(name) -> deleteSubDirectory(name)` pair
that nets to no-op still shipped both ops on commit, transmitting the
name string on the wire. The audit dismissed this as "not a leak
because subdirName is just a string" — that was wrong. Acknowledging
and fixing.

New `SubDirectory.dropIfSubsumedSubdirOp(subdirName, opType)`:

- A staged `createSubDirectory(name)` is subsumed if any later
  pending entry exists for the same name. If the subsumer is a
  `deleteSubDirectory(name)`, both entries are spliced so the
  subsequent delete call also drops itself. If the subsumer is
  another `createSubDirectory(name)` (rare), only the earlier create
  is spliced.
- A `deleteSubDirectory(name)` whose entry was already spliced as
  part of a paired create drops itself.
- A `deleteSubDirectory(name)` whose entry is still present (the
  pre-existing-subdir case) falls through to `reSubmitCore`. The
  name was already on the wire pre-staging, so there's no new leak.

Three regression tests in `squash.spec.ts`:

- `drops a staged createSubDirectory + deleteSubDirectory pair so the
  subdir name doesn't leak` — asserts neither subDirectoryCreated nor
  subDirectoryDeleted fires on the peer.
- `keeps the final create when staged ops are create+delete+create on
  the same name` — asserts exactly one create lands.
- `preserves a delete of a pre-existing subdirectory (no leak, no
  false subsumption)` — asserts the delete still reaches the peer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Correction to my own audit comment above. @anthony-murphy pointed out that subdirName is a leak, and they're right — user-supplied content can absolutely live in subdirectory names (user ids, tenant slugs, etc.). My earlier "no leak (only subdirName string is on the wire)" framing was wrong.

Fixed in d2bb799. New SubDirectory.dropIfSubsumedSubdirOp(subdirName, opType):

  • A staged createSubDirectory(name) is subsumed if any later pending entry for the same name exists. If the subsumer is a deleteSubDirectory(name), both entries are spliced so the matching delete call also drops itself — neither op reaches the wire.
  • A deleteSubDirectory(name) whose entry was already spliced (as part of a paired create) drops itself.
  • A deleteSubDirectory(name) whose entry is still present (the pre-existing-subdir case) falls through to reSubmitCore; the name was already on the wire pre-staging, so no new leak.

Three regression tests in squash.spec.ts:

✔ drops a staged createSubDirectory + deleteSubDirectory pair so the subdir name doesn't leak
✔ keeps the final create when staged ops are create+delete+create on the same name
✔ preserves a delete of a pre-existing subdirectory (no leak, no false subsumption)

The first test sets up a peer listener on subDirectoryCreated / subDirectoryDeleted and asserts zero events fire for the squashed pair on a name like "secret-id".

935/935 SharedMap+SharedDirectory unit tests pass (was 932 before this fix). Stress re-run in progress.

Stress seed 38 caught a divergence: client-2 has 0 keys at /dir3 while
client-0 has 1. The scenario, simplified:

- Client-1 sequenced createSubDirectory("/dir3") arrives at client-2.
- Client-2 had also created a local /dir3 instance and staged a
  set("prop2", "LY2c") on it, plus a deleteSubDirectory("/dir3"),
  plus a paired earlier createSubDirectory.
- On commit-squash, my code spliced the staged create + delete pair
  (correctly — they cancel). The staged dir3 instance is now a
  "ghost": no pendingSubDirectoryData entry references it.
- For the staged set on dir3, `dropIfSubsumedByLaterStorageOp` then
  called `isNotDisposedAndReachable()`, which only checks
  `getWorkingDirectory(absolutePath) !== undefined`. Because client-1's
  sequenced /dir3 is reachable at "/dir3", the check returned true and
  the set was resubmitted — landing on the *sequenced* dir3, not the
  ghost. Client-2's local view (no /dir3 at all) and the server's view
  (/dir3 with prop2 from this misrouted set) diverged.

Fix: tighten the check to verify the path resolves to *this exact
SubDirectory instance*. If `getWorkingDirectory(absolutePath) !== this`,
the staged op was queued against a different (now-ghost) instance and
must be dropped — never resubmitted onto whatever instance currently
owns the path.

200/200 stress seeds pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/dds/map/src/directory.ts
Comment thread packages/dds/map/src/test/mocha/squash.spec.ts
anthony-murphy and others added 3 commits May 15, 2026 13:48
Closes the property leak in `regeneratePendingOp(squash=true)`. The
prior squash flag only suppressed segment **text** when an insert was
later removed locally; the **property channel** carried original
values unchanged. The pattern
`annotateRange(0, n, {k: "secret"}) -> annotateRange(0, n, {k: "public"})`
or `insertText(0, "x", {k: "secret"}) -> annotateRange(0, 1, {k: "public"})`
inside a staging session still transmitted the secret value on commit.

Implementation:

- `SegmentGroupCollection.keysAnnotatedLaterThan(localSeq)` returns the
  set of property keys touched by annotate ops on this segment with
  `localSeq > given`. The keys are recovered from the per-segment
  entry of `SegmentGroup.previousProps` (annotate's `previousProps[i]`
  has the keys touched as its own keys).

- `resetPendingDeltaToOps` ANNOTATE case with `squash=true` filters
  `resetOp.props` to drop keys present in `keysAnnotatedLaterThan`.
  If every key is filtered out the op is dropped entirely (no
  `createAnnotateRangeOp` call).

- `resetPendingDeltaToOps` INSERT case with `squash=true` filters
  `segInsertOp.properties` the same way. The insert's segment text
  still flows (we're emitting the insert); only the per-key values
  later overwritten by a staged annotate are stripped. If the
  filter leaves no keys, `properties` is set to `undefined`.

Two regression tests added to `sharedString.spec.ts` under
"squash property channel":

- "drops a staged annotate value overridden by a later staged annotate
  on the same range" — peer never sees `secret-color`; final value is
  `public-color`.

- "drops a staged insert's property value overridden by a later staged
  annotate" — same assertion for the insert+annotate path.

The adjust-op path (`resetOp.adjust`) and the `OBLITERATE` path are
unchanged — the former needs different semantics for numeric deltas
and the latter has its own squash story.

Interval-collection property channel (intervalCollection.ts:932-935
property-only CHANGE bypass and rebaseLocalInterval property
preservation) is still a known gap, addressed in a follow-up.

1666/1666 sequence tests pass; 1555/1555 merge-tree tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the interval-collection property leak the audit surfaced.
The prior path bypassed `rebaseLocalInterval` for property-only
CHANGEs (intervalCollection.ts:932-935 `endpointChangesNode ===
undefined ? value : rebase`), and the rebase itself preserved
`...original.properties` unchanged. Patterns like

  interval.add({ ..., props: { color: "secret" } })
  interval.removeIntervalById(id)

or

  interval.change(id, { props: { color: "secret" } })
  interval.change(id, { props: { color: "public" } })

inside a staging session still transmitted the secret value on
commit.

Implementation mirrors the merge-tree property squash:

- `IntervalAddLocalMetadata` and `IntervalChangeLocalMetadata` carry
  a new `propertyKeys?: ReadonlySet<string>` recording the keys the
  op submitted. Populated at submit time from `props` arg.

- `IntervalCollection.resubmitMessage` now, before deciding the
  rebase path, asks two questions for ADD/CHANGE under `squash=true`:
  - "Is there a later DELETE for this interval id in pending?" — if
    yes, drop the op entirely.
  - "Which property keys did later staged ADD/CHANGE ops on this
    interval touch?" — filter those keys out of the op's
    `value.properties` before the rebase/submit. If filtering
    leaves no keys, `properties` becomes `undefined`.

- Two helpers (`hasLaterDeleteForInterval`, `collectLaterPropertyKeysForInterval`)
  walk the per-interval `pending[id].local` DoublyLinkedList. The
  filtering is non-destructive — `filterPropertiesForSquash` spreads
  a new value rather than mutating the original.

DELETE ops are unchanged. CHANGE ops with endpoint changes still go
through `rebaseLocalInterval` for endpoint rebasing; the property
filter runs on the already-rebased value before submit.

Two regression tests added to the existing "squash property channel"
describe in sharedString.spec.ts:

- "drops a staged interval add subsumed by a later staged delete" —
  asserts the peer's `addInterval` event never sees the secret prop.

- "drops a staged interval change's property value overridden by a
  later staged change" — asserts the peer's `propertyChanged` event
  never sees the intermediate prop value; final value lands.

1668/1668 sequence tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move merge-tree and sequence into the affected packages list now
that property-level squash is implemented for both. Remove the
sequence-property-leak entry from "known limitations".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Closing the property-channel gap the audit identified.

What changed

merge-tree (40760b2)

SegmentGroupCollection.keysAnnotatedLaterThan(localSeq) returns the keys touched by later annotates on a given segment — derived from each later `SegmentGroup`'s `previousProps[i]` (whose keys are exactly the keys the annotate touched, by construction of `propertyManager.handleProperties`).

In `Client.resetPendingDeltaToOps` under `squash=true`:

  • ANNOTATE case: `resetOp.props` is filtered against `keysAnnotatedLaterThan(segmentGroup.localSeq)`. If every key is overridden, the op is dropped entirely (no `createAnnotateRangeOp` call). Adjust ops (`resetOp.adjust`) are unchanged — different semantics.
  • INSERT case: `segInsertOp.properties` is filtered the same way. The segment text still flows; only the per-key values later overwritten by a staged annotate are stripped. If the filter leaves no keys, `properties` becomes `undefined`.

sequence (interval collection) (8c68884)

`IntervalAddLocalMetadata` and `IntervalChangeLocalMetadata` now carry a `propertyKeys?: ReadonlySet` populated at submit time from the `props` arg.

In `IntervalCollection.resubmitMessage` under `squash=true` for ADD or CHANGE:

  • If a later pending op for the same interval id is a DELETE, the entire ADD/CHANGE is dropped.
  • Otherwise `value.properties` is filtered against the union of `propertyKeys` from later pending ADD/CHANGE ops on the same interval. Filtering is non-destructive (spread, not mutation). If the filter empties the property set, `properties` becomes `undefined`.

The prior `endpointChangesNode === undefined` bypass that skipped `rebaseLocalInterval` entirely for property-only CHANGE ops is still in place for the rebase path, but the property filter now runs before that branch, so property-only changes get squash treatment.

Regression tests

Four new tests in the existing `squash property channel` describe block:

```
✔ drops a staged annotate value overridden by a later staged annotate on the same range
✔ drops a staged insert's property value overridden by a later staged annotate
✔ drops a staged interval add subsumed by a later staged delete
✔ drops a staged interval change's property value overridden by a later staged change
```

Each test sets up a peer listener (segment `sequenceDelta` or interval `addInterval`/`propertyChanged`) and asserts the secret value never appears in the captured stream while the final value lands correctly.

Test posture

  • 1555/1555 merge-tree tests pass
  • 1668/1668 sequence tests pass (4 new)
  • 935/935 SharedMap+SharedDirectory tests pass
  • 200/200 stress seeds pass

Remaining audit gaps (documented, not in this PR)

The audit also flagged user-content leaks on identity-squashed DDSes where the op itself carries the leaked value and the DDS's semantics make true squash structurally harder than per-key filtering:

  • `ConsensusOrderedCollection.add(value)`: the queue's `acquireId` is server-assigned, so a client can't tell at squash time whether a staged `add → acquire → complete` chain actually resolves to its own add. Documented in the override.
  • legacy `SharedArray.insertEntry(value)`: insert-then-delete on the same `entryId` could be squashed but the DDS's internal pending state (counters per entry + a doubly-linked skip list) makes the change non-trivial. Documented.
  • `ConsensusRegisterCollection` writes participate in `readVersions()` history; intentional.
  • `Ink` and legacy `SharedSignal` ops are fire-and-forget user-event content; intentional.

If any of these blocks ship, I can prioritize them next.

anthony-murphy and others added 8 commits May 15, 2026 17:21
Track pending ops with a FIFO of metadata records and walk the
chain forward from each staged insertEntry through moveEntry hops:
if the chain ends in a delete (deleteEntry or toggle(true)), splice
every op in the chain so the user-supplied insert value never
reaches the wire. toggleMove disables the optimization for that
chain to avoid second-guessing skip-list rewiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds sharedArraySquash.fuzz.spec.ts wiring SharedArray into
createSquashFuzzSuite. Reworks the squash chain analysis from
per-call walking to a one-shot plan over pendingOps: drops are
collected per-chain, then insertAfter rewrites are computed for
any non-dropped insert/move whose anchor entry was dropped (by
walking sharedArray backward to the nearest non-dropped entry).

The plan is cached per resubmit batch and invalidated on
submitArrayOp / local-ack. 19 fuzz seeds still expose
multi-stage interaction edge cases (skipped); 31 seeds pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quashFuzzSuite

Adds squash fuzz specs for the three DDSes whose squash logic does
real subsumption (LWW for SharedMap/Directory, per-cell for
SharedMatrix). Each spec defines a poisoned-handle op, an exiting-
mode generator that emits removal ops, and a validation walk that
asserts no poisoned handle survives in the local view.

All three suites pass at the default 50-seed budget; no production
code changes required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a squash fuzz spec for SharedCell. Cell had no prior fuzz
infrastructure, so this also adds the workspace dependencies it
needs (stochastic-test-utils, client-utils, runtime-utils) and a
dirname.cts shim matching the pattern used by other DDS packages.

Also picks up biome format passes on the map/directory/matrix
squash fuzz specs from the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two corrections to the squash plan:

1. The rewrite target for a non-dropped insert whose anchor was
   dropped must already be on the wire when this op is submitted.
   Walk pendingOps order: a sharedArray predecessor only qualifies
   if it's acked pre-staging or its pendingOps index is earlier
   than the op being rewritten.

2. Entries dropped by an earlier squash batch persist in
   sharedArray as deleted but never reach peers. A new persistent
   `wireBlacklist` set tracks them so subsequent cycles skip them
   when choosing rewrite targets — and trigger a rewrite when an
   op's stored `insertAfterEntryId` points into one.

Clears all 19 previously-skipped seeds in the SharedArray squash
fuzz suite; all 50 seeds now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three categories of CI fixes:

1. biome format ran across sharedArray.ts (after the wire-aware
   rewrite refactor) and sharedArraySquash.fuzz.spec.ts.
2. flub generate assertTags reassigned three Map directory.ts
   shortcodes that collided with tree and presence-runtime in main,
   and registered new shortcodes for SharedCell, MapKernel, and
   tree's chunkTree/chunkedForest.
3. Reworded the squash changeset to drop terms vale's spell-check
   flagged (config -> configuration; changeset -> these changes; op
   type names lowercased and inline-coded).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rules

Strip spaces around em-dashes and replace "e.g." with "for example".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anthony-murphy and others added 2 commits May 17, 2026 23:30
The assertTag generator widened the assert line past the line limit;
break it across multiple lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dropIfSubsumedSubdirOp matched same-name lifecycle ops by name+opType, so
delete→create→delete on a pre-existing subdir and pre-staging-create + staged
delete+create on the same name both wrong-paired entries — splicing a
pre-staging-in-flight entry instead of its staged sibling and asserting 0xc31
on ACK. Plumb a back-pointer to the originating PendingSubDirectoryEntry
through SubDirLocalOpMetadata (mirroring PendingKeySet.lifetime) and look up
by reference. Refactor resubmitSubDirectoryMessage to use the same back-pointer
instead of findLast by name+type.

reconnectAndSquash was forcing squash=true on every pending op, so cell.spec.ts
"preserves a pre-staging set" could not fail — the pre-staging set was squashed
along with the staged ops. Add enterStagingMode(containerRuntime) that snapshots
the pre-staging/staging boundary at disconnect; reconnectAndSquash now applies
squash=true only to ops at index >= boundary, falling back to "squash all" when
no boundary was recorded (preserves fuzz-harness behavior). Update the four
"preserves pre-staging" tests in cell/counter/map and strengthen the Cell and
Map same-key tests to also assert the pre-staging value lands on the peer
(previously hollow).

Add two regression tests in squash.spec.ts: staged delete→create→delete on a
pre-existing subdir (Repro A) and pre-staging create + staged delete+create on
the same name (Repro B). Both fail without these fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/dds/sequence/src/intervalCollection.ts Outdated
Comment thread packages/dds/legacy-dds/src/array/sharedArray.ts Outdated
IntervalCollection.resubmitMessage's squash gate covered "add" and "change"
but not "delete". A staged add({props: secret}) → removeIntervalById pair
correctly dropped the add (via hasLaterDeleteForInterval) but the paired
delete fell through to submitDelta with interval.serialize() carrying the
interval's full property bag, so the secret rode out on commit even though
ackDelete on peers only uses the intervalId. Extend the gate: drop the delete
entirely when its interval has an earlier staged add (created and removed
within the same staging batch — peers never saw it); otherwise strip the
payload's property bag down to the identity keys (intervalId,
referenceRangeLabels) so user-supplied properties on a pre-existing interval
don't ride out on a staged delete.

SharedArray.computeSquashPlan walked the entire pendingOps array, so the
walkInsertChain rooted at a pre-staging insertEntry (still in flight) pulled
a paired staged deleteEntry into the drop set and spliced both from
pendingOps. The pre-staging insert had already gone to the wire, the staged
delete was suppressed, and the entry stayed alive remotely while the local
view showed it deleted — silent data corruption. Anchor a stagingBoundaryIdx
at the index of the first op passed to reSubmitSquashed and skip chain roots
below it; pre-staging ops resubmit normally (squash=false via the new
harness, or via reSubmitCore in production) without being dropped. Reset the
boundary alongside cachedSquashPlan on new submits and local acks.

Two regression tests: the interval test captures wire ops via
containerRuntimeFactory.pushMessage and asserts the secret property never
appears in any delete payload; the SharedArray test uses enterStagingMode
plus a staged delete of a pre-staging insert and asserts both clients end
empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Deep Review

Reviewed commit 10ace64 on 2026-05-19.

Readiness: 5/10 — MAKING PROGRESS

Not ready for sign-off. Both Tier 2 issues flagged on 1e3232f are resolved on this head — stripUserPropertiesFromDeletePayload scrubs the delete-op payload and stagingBoundaryIdx scopes SharedArray.computeSquashPlan to the staged slice. A fresh Tier 2 surfaces in the pair-cancel branch the IntervalCollection fix added: hasEarlierAddForInterval (intervalCollection.ts:3870-3883) walks this.pending[id].local with no staging-batch tag, so a pre-staging add still in flight plus a staged delete makes the staged delete drop itself via clearEmptyPendingEntry — peers retain an interval the local view shows deleted. Flagged inline on intervalCollection.ts. Held Tier 3 backlog (PR-description drift, three duplicate shortcode strings, Counter/Ink/CRC/PactMap design tradeoffs, fallback default) is unchanged.

Path to Ready

  • Resolve inline threads
  • After the IntervalCollection fix lands, refresh the PR body's "mixed-lifetime case" section to describe the per-op tryResubmitSquashedMessage / dropIfSubsumed design using the PendingKeySet.lifetime back-pointer (mapKernel.ts:1820, directory.ts:1421) — the current text still describes a squashPendingDataForBatch / finalKeyOps compute-then-mutate algorithm that does not exist in the diff
  • Move legacy SharedArray out of the per-DDS table's "Identity squash" row; document the chain-walker, the stagingBoundaryIdx-scoped wireBlacklist lifetime, and the residual insertEntry.value leak for non-subsumed inserts
  • Add a row for the new sequence property-channel filter (SegmentGroupCollection.keysAnnotatedLaterThan + IntervalCollection property-key filtering) — the current "SharedSequence/intervals — Unchanged" line is wrong
  • Differentiate the three duplicate-message shortcodes (0xcf9 / 0xcfd / 0xcfe, 0xcfb / 0xcff, 0xcfc / 0xd02) in assertionShortCodesMap.ts:4377-4382
  • Complete the final CI box in the test plan (typecheck / lint / API-extractor) and re-confirm 200/200 stress seeds on the IntervalCollection-fix head
  • Un-draft once the above land

Context for Reviewers

  • Lands the squash-rebase work explicitly deferred from staging-mode introduction in #23783 ("squash rebase, where content that is added then removed is not included in the final set of ops sent over the wire"). reSubmitSquashed mirrors the cross-DDS uniform-method template as applyStashedOp from this author's #19518 — useful as a reading template.
  • mapKernel.ts / directory.ts are in a recently-unstable regime: #25070 (ChumpChief, 2025-07-22) re-landed PendingKeyLifetime/PendingKeySet after #24961 was reverted via #25066 over stress-test failures. This PR extends the same data model with a lifetime back-pointer plus tryResubmitSquashedMessage / dropIfSubsumed / dropIfSubsumedByLaterStorageOp. Two earlier rounds of this review already caught a disposed-subdir guard hole (fixed in 2149ab7) and a name-only dropIfSubsumedSubdirOp matching that caused wrong-pair splicing on delete→create→delete (fixed in d2bb799). Treat the 200/200 stress-seed result as load-bearing.
  • Counter identity-squash is deliberate: #25771 (ChumpChief, Counter rollback) warned that comparing only increment properties is ambiguous when many ops increment by the same value. Tests at counter.spec.ts:180-210 assert two remote events with amounts 10 and -3, not a single net 7 — coalescing would change observable behavior.
  • Suggested reviewers from prior ownership:
    • ChumpChiefmap.ts / mapKernel.ts / directory.ts / counter.ts (#25070 PendingKeyLifetime, Counter rollback #25771)
    • Abe27342test-dds-utils, matrix, merge-tree, all *.squash.fuzz.spec.ts (merge-tree squash #24424, DDS squash fuzz framework #24425; same-name create/delete cycles on #15493)
    • markfieldslocal-server-stress-tests and the runtime ↔ DDS reSubmitSquashed contract (staging-mode #23783)
    • scarlettjleecell.ts (Cell rollback #12273)
    • alexvy86 — assertion shortcode policy (#10488)
For human reviewer
  • Runtime ↔ DDS contract (markfields) — Does the runtime guarantee that pre-staging ops still in flight are never passed to reSubmitSquashed, and that they are not present in per-DDS pending lists at squash time? The new IntervalCollection finding, the SharedArray stagingBoundaryIdx fix, and the MapKernel lifetime back-pointer all hinge on the answer. Worth an explicit ack rather than inferring from the harnesses.
  • Staging-boundary plumbing (Abe27342 / ChumpChief) — Per-DDS implementations diverge: Map/Directory use a per-op lifetime back-pointer, SharedArray uses a single boundary index, test-dds-utils/utils.ts:23-28 exposes firstStagedIndex directly, and IntervalCollection (post-fix) will need its own answer. Standardize one mechanism, or keep per-DDS shape?
  • Counter identity-squash semantics (ChumpChief) — Sanity check on preserving per-op "incremented" events rather than algebraic coalescing. Both blind design reviewers expected coalescing; the choice is defensible but worth confirming as the final position.
  • Ink stroke/clear semantics (ChumpChief) — Identity squash transmits stroke payloads when a staged createStroke → clear chain occurs. Documented as a limitation; stroke data is user content. Final position or follow-up?
  • AllowStagingModeWithoutSquashing default (markfields) — Permissive default contradicts the PR's stated privacy goal for out-of-tree DDSes. Default flip vs. one-time telemetry/warning is a deprecation-policy call.
  • ConsensusRegisterCollection / PactMap intermediate-value question (markfields / ChumpChief) — Whether to drop locally-superseded same-key writes is a consensus-semantics design call.
  • Stress runtime characteristics — Author cites 200/200 stress seeds passing. The perf cost of dropIfSubsumedByLaterStorageOp over long lifetimes, and of any staged-slice scan in the IntervalCollection fix, cannot be assessed by static review.
  • Cross-fork compatibility — Whether out-of-tree DDS implementers should be notified of the reSubmitSquashed contract ahead of any future AllowStagingModeWithoutSquashing default flip.
Review history (5 prior reviews)
  • 1e3232f 2026-05-19 · 5/10 — IntervalCollection delete-gate omits property scrub; SharedArray cross-staging walk flagged
  • 9a28475 2026-05-18 · 5/10dropIfSubsumedSubdirOp name+opType matching and reconnectAndSquash harness over-marking flagged
  • 887e66b 2026-05-15 · 5/10SubDirectory.dropIfSubsumedSubdirOp wrong-pair splicing on same-name delete→create→delete and pre-staging-create + staged cycle
  • 600ec6d 2026-05-15 · 5/10 — Counter and MapKernel fixes resolved prior threads; SharedDirectory disposed-subdir set + deleteSubDirectory leak flagged
  • 65c8a84 2026-05-15 · 4/10 — Counter and MapKernel correctness bugs flagged inline

@github-actions
Copy link
Copy Markdown
Contributor

🔗 No broken links found! ✅

Your attention to detail is admirable.

linkcheck output


> fluid-framework-docs-site@0.0.0 ci:check-links /home/runner/work/FluidFramework/FluidFramework/docs
> start-server-and-test "npm run serve -- --no-open" 3000 check-links

1: starting server using command "npm run serve -- --no-open"
and when url "[ 'http://127.0.0.1:3000' ]" is responding with HTTP status code 200
running tests using command "npm run check-links"


> fluid-framework-docs-site@0.0.0 serve
> docusaurus serve --no-open

[SUCCESS] Serving "build" directory at: http://localhost:3000/

> fluid-framework-docs-site@0.0.0 check-links
> linkcheck http://localhost:3000 --skip-file skipped-urls.txt

Crawling...

Stats:
  288859 links
    1925 destination URLs
    2175 URLs ignored
       0 warnings
       0 errors


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant