Skip to content

Fix server/client epoch skew#522

Draft
kmontemayor2-sc wants to merge 2 commits intomainfrom
kmonte/sync-epochs
Draft

Fix server/client epoch skew#522
kmontemayor2-sc wants to merge 2 commits intomainfrom
kmonte/sync-epochs

Conversation

@kmontemayor2-sc
Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Feb 27, 2026

Problem

In graph store mode, multiple GPUs on the same compute node share a single producer per storage server. The server's monotonic epoch guard (if
cur_epoch < epoch) means only the first GPU to call start_new_epoch_sampling triggers produce_all(). When a slow GPU falls behind, its stale
epoch gets silently skipped, it finds an empty buffer, StopIteration escapes InfiniteIterator's single retry, and one rank exits training early
— causing an NCCL deadlock.

Fix (2 files)

gigl/distributed/graph_store/dist_server.py — Changed start_new_epoch_sampling to return tuple[int, bool] (server epoch + whether production
was triggered) instead of None. This gives the client the information it needs to detect and recover from epoch skew.

gigl/distributed/base_dist_loader.py — Extracted the graph store branch of iter into _request_new_epoch_production(), which handles two
cases:

  • Same epoch (self._epoch >= max_server_epoch): Another GPU already triggered production. Data is in the shared buffer. Return immediately — no
    extra production.
  • Behind (self._epoch < max_server_epoch): Genuinely stale. Fast-forward to max_server_epoch + 1 and retry, which guarantees produce_all()
    fires.

Test (1 file)

tests/integration/distributed/graph_store/graph_store_integration_test.py — Added test_epoch_skew_recovery that:

  1. Iterates a loader through 2 normal epochs (server epoch advances to 1)
  2. Resets loader._epoch = 0 to simulate a slow GPU
  3. Iterates again and asserts all 2708 batches are produced

The test log confirms the fix: "Epoch skew detected: client epoch 0 behind server epoch 1. Retrying with epoch 2 (attempt 1)." → 2708 batches
loaded successfully.

@kmontemayor2-sc
Copy link
Collaborator Author

/all_test

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

GiGL Automation

@ 21:22:47UTC : 🔄 Python Unit Test started.

@ 22:31:59UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

GiGL Automation

@ 21:22:48UTC : 🔄 Scala Unit Test started.

@ 21:32:13UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

GiGL Automation

@ 21:22:49UTC : 🔄 Lint Test started.

@ 21:30:23UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

GiGL Automation

@ 21:22:50UTC : 🔄 Integration Test started.

@ 22:40:57UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

GiGL Automation

@ 21:22:51UTC : 🔄 E2E Test started.

@ 22:44:30UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc kmontemayor2-sc changed the title Kmonte/sync epochs Fix server/client epoch skew Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants