Fix server/client epoch skew by kmontemayor2-sc · Pull Request #522 · Snapchat/GiGL

kmontemayor2-sc · 2026-02-27T21:22:29Z

Problem

In graph store mode, multiple GPUs on the same compute node share a single producer per storage server. The server's monotonic epoch guard (if
cur_epoch < epoch) means only the first GPU to call start_new_epoch_sampling triggers produce_all(). When a slow GPU falls behind, its stale
epoch gets silently skipped, it finds an empty buffer, StopIteration escapes InfiniteIterator's single retry, and one rank exits training early
— causing an NCCL deadlock.

Fix (2 files)

gigl/distributed/graph_store/dist_server.py — Changed start_new_epoch_sampling to return tuple[int, bool] (server epoch + whether production
was triggered) instead of None. This gives the client the information it needs to detect and recover from epoch skew.

gigl/distributed/base_dist_loader.py — Extracted the graph store branch of iter into _request_new_epoch_production(), which handles two
cases:

Same epoch (self._epoch >= max_server_epoch): Another GPU already triggered production. Data is in the shared buffer. Return immediately — no
extra production.
Behind (self._epoch < max_server_epoch): Genuinely stale. Fast-forward to max_server_epoch + 1 and retry, which guarantees produce_all()
fires.

Test (1 file)

tests/integration/distributed/graph_store/graph_store_integration_test.py — Added test_epoch_skew_recovery that:

Iterates a loader through 2 normal epochs (server epoch advances to 1)
Resets loader._epoch = 0 to simulate a slow GPU
Iterates again and asserts all 2708 batches are produced

The test log confirms the fix: "Epoch skew detected: client epoch 0 behind server epoch 1. Retrying with epoch 2 (attempt 1)." → 2708 batches
loaded successfully.

kmontemayor2-sc · 2026-02-27T21:22:36Z

/all_test

github-actions · 2026-02-27T21:22:48Z

GiGL Automation

@ 21:22:47UTC : 🔄 Python Unit Test started.

@ 22:31:59UTC : ✅ Workflow completed successfully.

github-actions · 2026-02-27T21:22:48Z

GiGL Automation

@ 21:22:48UTC : 🔄 Scala Unit Test started.

@ 21:32:13UTC : ✅ Workflow completed successfully.

github-actions · 2026-02-27T21:22:49Z

GiGL Automation

@ 21:22:49UTC : 🔄 Lint Test started.

@ 21:30:23UTC : ✅ Workflow completed successfully.

github-actions · 2026-02-27T21:22:50Z

GiGL Automation

@ 21:22:50UTC : 🔄 Integration Test started.

@ 22:40:57UTC : ✅ Workflow completed successfully.

github-actions · 2026-02-27T21:22:52Z

GiGL Automation

@ 21:22:51UTC : 🔄 E2E Test started.

@ 22:44:30UTC : ✅ Workflow completed successfully.

kmonte added 2 commits February 27, 2026 18:27

try to sync epochs

1f25f37

maybe fix

c8a04d0

kmontemayor2-sc changed the title ~~Kmonte/sync epochs~~ Fix server/client epoch skew Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix server/client epoch skew#522

Fix server/client epoch skew#522
kmontemayor2-sc wants to merge 2 commits intomainfrom
kmonte/sync-epochs

kmontemayor2-sc commented Feb 27, 2026 •

edited

Loading

Uh oh!

kmontemayor2-sc commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kmontemayor2-sc commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmontemayor2-sc commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kmontemayor2-sc commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading