Skip to content

Fix eventual consistency bug#21

Merged
nerdsane merged 1 commit intonerdsane:mainfrom
carlsverre:fix-stale-overwrite
Apr 2, 2026
Merged

Fix eventual consistency bug#21
nerdsane merged 1 commit intonerdsane:mainfrom
carlsverre:fix-stale-overwrite

Conversation

@carlsverre
Copy link
Copy Markdown
Contributor

This is an AI-generated change audited by me (a human). The bug causes a replica to permanently fall out of sync and not converge. I found this while playing with this code after reading this fascinating blog post: https://www.datadoghq.com/blog/ai/harness-first-agents/

AI bug description below:


When ReplicatedShardActor applied a remote delta, it first merged the delta into replica_state and then updated the executor from the incoming delta payload. That meant a stale delta could lose the merge in replica_state but still overwrite the executor, so subsequent reads returned the wrong value.

Fix this by reloading the merged value from replica_state after apply_remote_delta and projecting that merged value into the executor. This preserves the existing CRDT and LWW merge semantics, keeps read behavior aligned with replicated state, and is safe because it only changes which already-computed value is mirrored into the executor; it does not change message ordering, conflict resolution, or local write generation.

When ReplicatedShardActor applied a remote delta, it first merged the delta into replica_state and then updated the executor from the incoming delta payload. That meant a stale delta could lose the merge in replica_state but still overwrite the executor, so subsequent reads returned the wrong value.

Fix this by reloading the merged value from replica_state after apply_remote_delta and projecting that merged value into the executor. This preserves the existing CRDT and LWW merge semantics, keeps read behavior aligned with replicated state, and is safe because it only changes which already-computed value is mirrored into the executor; it does not change message ordering, conflict resolution, or local write generation.
Copy link
Copy Markdown
Owner

@nerdsane nerdsane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this catch, @carlsverre! The bug is confirmed — apply_remote_delta_impl uses the incoming delta value instead of the CRDT-merged result from replica_state, causing reads to return stale data when a lower-timestamp delta arrives.

I'll address the remaining items and merge this myself:

  • Fix all branches (scalar, hash fields + tombstones, expiry) to read from merged state
  • Fix the same pattern in multi_node.rs simulation code
  • Add TigerStyle postcondition (executor-replica_state consistency assertion)
  • Add regression test for stale delta scenario
  • Rebuild and re-run Maelstrom linearizability tests

Thanks for reading the blog, and doing a deep-dive. The self-heal loop continues. 🔧

@nerdsane nerdsane merged commit 651f1f1 into nerdsane:main Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants