Fix resolve token conflict logic under race condition#7554
Conversation
Signed-off-by: Alex Le <leqiyue@amazon.com>
…tegy" This reverts commit aff3b18. Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
|
Nice catch. Thanks. We do some other code that helps to avoid conflict tokens at would this fix the issue? I wonder if we should use something similar to this config to check tokens everytime. |
Signed-off-by: Alex Le <leqiyue@amazon.com>
I updated the code to only do constantly conflic check when instance is still in JOINING state. JOINING state would last for ObservePeriod of time. And during ObservePeriod, instance would do heartbeat and call CAS and eventually would check token conflicts. So the heavy computation should only last for ObservePeriod of time. After instance becomes ACTIVE, the logic would be same as before this change. |
What this PR does:
Inside ring module, there is logic to check conflict/duplicated tokens and resolve the conflict. However, during race condition that two instances joined the ring at same time, this logic will not detect conflicts during the first CAS cycle because none of new instances appeared in the ring. On the following CAS calls, none of those instance would do conflict check again because it did not detect any token changed.
Also, there was another issue in the original code. The issue was that even resolve token function made the correct operation. Token from instances that were not detected as newly added or updated will not be updated with new tokens even they have conflicts.
This change fixs both issues in the code. With the fix, each CAS call would do conflict check while instance is still in JOINING state. It takes O(n*m) where n is number of instances in ring and m is number of tokens per instance. So the check should not add too much overhead for each CAS call and it only keeps constantly check during observe period. Also, add logic to make sure all instances got updated token during resolving conflict process would be properly updated. And this update should be idempotent to avoid updates around the same time produce the same result.
Which issue(s) this PR fixes:
NA
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags