Draft
Conversation
Collaborator
Author
|
/all_test |
Contributor
GiGL Automation@ 21:22:47UTC : 🔄 @ 22:31:59UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 21:22:48UTC : 🔄 @ 21:32:13UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 21:22:49UTC : 🔄 @ 21:30:23UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 21:22:50UTC : 🔄 @ 22:40:57UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 21:22:51UTC : 🔄 @ 22:44:30UTC : ✅ Workflow completed successfully. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In graph store mode, multiple GPUs on the same compute node share a single producer per storage server. The server's monotonic epoch guard (if
cur_epoch < epoch) means only the first GPU to call start_new_epoch_sampling triggers produce_all(). When a slow GPU falls behind, its stale
epoch gets silently skipped, it finds an empty buffer, StopIteration escapes InfiniteIterator's single retry, and one rank exits training early
— causing an NCCL deadlock.
Fix (2 files)
gigl/distributed/graph_store/dist_server.py — Changed start_new_epoch_sampling to return tuple[int, bool] (server epoch + whether production
was triggered) instead of None. This gives the client the information it needs to detect and recover from epoch skew.
gigl/distributed/base_dist_loader.py — Extracted the graph store branch of iter into _request_new_epoch_production(), which handles two
cases:
extra production.
fires.
Test (1 file)
tests/integration/distributed/graph_store/graph_store_integration_test.py — Added test_epoch_skew_recovery that:
The test log confirms the fix: "Epoch skew detected: client epoch 0 behind server epoch 1. Retrying with epoch 2 (attempt 1)." → 2708 batches
loaded successfully.