Skip to content

Refactor/globalindex external kv#330

Open
maning00 wants to merge 2 commits into
refactor/umbp-dual-scheme-abcfrom
refactor/globalindex-external-kv
Open

Refactor/globalindex external kv#330
maning00 wants to merge 2 commits into
refactor/umbp-dual-scheme-abcfrom
refactor/globalindex-external-kv

Conversation

@maning00
Copy link
Copy Markdown
Contributor

No description provided.

maning00 added 2 commits May 20, 2026 12:33
Consolidate the parallel ExternalKvBlockIndex into GlobalBlockIndex by
adding a LocationOwner discriminator (UMBP_OWNED vs EXTERNAL_HICACHE)
on every Location entry.  Storage-backed blocks and external sglang
bind/unbind notifications now share one index, one lookup path, and
one eviction picker, removing a long-standing source of drift between
the two views.

- types: add LocationOwner enum, Location::SameIdentity, FullSyncScope
- global_block_index: owner-aware ApplyEvents / ReplaceNodeLocations /
  MatchExternal / FindEvictionCandidates
- master_client: ack-retained bundle outbox with seq numbers;
  owner-scoped full sync; bind/unbind/clear/flush external hashes APIs
- master_server: drop ReportExternalKv{Add,Remove,Clear} RPC handlers;
  MatchExternalKv now reads from the unified GlobalBlockIndex
- pybind: expose bind_external_hashes / unbind_external_hashes /
  unbind_all_external_hashes_at_tier / flush_external_queue
- proto: remove deprecated mutation RPCs (events ship via heartbeat
  bundles instead)
- delete external_kv_block_index.{h,cpp} and its dedicated unit tests;
  refresh test_global_block_index_events / test_router_dedup /
  test_peer_dram_allocator coverage
Restore the pre-v2.5 ReportExternalKvBlocks / RevokeExternalKvBlocks /
RevokeAllExternalKvBlocksAtTier RPC surface as a thin compatibility
layer on top of the unified GlobalBlockIndex.  The deleted
ExternalKvBlockIndex class is NOT brought back — the restored handlers
delegate directly to GlobalBlockIndex::ApplyEvents with
LocationOwner::EXTERNAL_HICACHE, sharing the same backing store as the
v2.5 heartbeat bundle outbox path.

Two distinct surfaces, both consistent with pre-v2.5 signatures:

  * UMBPMasterClient — 3-arg explicit-node_id sync RPCs for
    schedulers / sidecars that report on behalf of a registered
    worker.  Report requires node_id alive in ClientRegistry; revoke
    paths skip the alive check (index delete is always allowed).
    Empty node_id / empty hashes return INVALID_ARGUMENT.

  * mori.cpp.UMBPClient — 2-arg implicit-self aliases that route
    through BindExternalHashes / UnbindExternalHashes /
    UnbindAllExternalHashesAtTier + FlushExternalQueue, so the
    entries are tracked in external_current_set_ and survive a
    subsequent FULL_SYNC_EXTERNAL_HICACHE replay.

Other changes:

  * proto: restore 3 request/response messages and service entries
    byte-identical to pre-v2.5 (existing master and new master are
    wire-compatible for these RPCs and MatchExternalKv)
  * global_block_index: RemoveLocationsLocked now returns the count
    of removed locations; ApplyEvents accumulates it for
    CLEAR_AT_TIER so RevokeAllExternalKvBlocksAtTier reports a
    truthful BLOCKS_TOTAL metric
  * master_server: reuse existing MORI_UMBP_METRIC_EXT_KV_*
    constants instead of inventing new names; consistent with
    existing dashboards
  * IUMBPClient / DistributedClient / PoolClient / StandaloneClient:
    propagate the 2-arg API down the data-plane stack (no-op stub
    on standalone)
  * docs/api/umbp.rst: restore method-table rows and add a
    "Where to call from" subsection explaining the two write paths
    and their visibility / lifecycle differences
  * src/umbp/doc/design-external-kv-report-revoke-bc.md: full
    design notes (background, API surface, alive-check rationale,
    coexistence with bundle outbox, test plan)
  * tests: 12 new Python tests covering both surfaces, including a
    regression guard (test_umbpclient_two_arg_alias_survives_external_full_sync)
    that prevents the 2-arg alias from accidentally being changed
    to a non-outbox path
  * test_global_block_index_events: assert the new
    CLEAR_AT_TIER mutated-count return value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant