fix(fuse): close OpsNode bypass gap in FUSE_INTERRUPT decoupling#38
Merged
Conversation
Three OpsNode methods (Lookup, Getattr, Readlink) bypass FSAdapter and call Operations directly. The previous fix (4c7b4c1) wrapped FSAdapter.{Stat, ReadDir, ReadFile} with decoupleFromRequestCancel, but the bypass paths kept propagating the kernel's per-request context all the way into pgx. Symptom: docker-FUSE stress runs hit a transient ENOENT for a recently-created/renamed file, recovering only after the 2s statCacheTTL. Pattern: kernel sends FUSE_INTERRUPT during a LOOKUP (typically from Go's SIGURG goroutine preemption under GC pressure), go-fuse cancels the request ctx, the cancellation reaches db.ResolvePath via the OpsNode bypass, pgx returns context.Canceled, resolveSynthPath translates it to "", false, statSynthFile turns the resulting ErrNotExist into a setNegative entry, and subsequent reads (including the runner's retries) hit the stale negative for the full 2s TTL. Fix: same one-line wrap at the entry of each bypass site -- ctx = decoupleFromRequestCancel(ctx). Three lines plus doc-comment blocks explaining each method's bypass. Validation: * Before: ~20% of docker-FUSE 1500-iter stress runs hit the cache-bug signature (probe: "DB has row; FS recovered visibility at offset 2000ms"). * After: 0/8 attempts hit it across ~12000 total iterations. Two attempts hit DIFFERENT failure shapes (close-time EIO, rename mid-flight ENOENT) -- pre-existing different bugs, out of scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three OpsNode methods (Lookup, Getattr, Readlink) bypass FSAdapter and call Operations directly. The previous fix (4c7b4c1) wrapped FSAdapter.{Stat, ReadDir, ReadFile} with decoupleFromRequestCancel, but the bypass paths kept propagating the kernel's per-request context all the way into pgx.
Symptom: docker-FUSE stress runs hit a transient ENOENT for a recently-created/renamed file, recovering only after the 2s statCacheTTL. Pattern: kernel sends FUSE_INTERRUPT during a LOOKUP (typically from Go's SIGURG goroutine preemption under GC pressure), go-fuse cancels the request ctx, the cancellation reaches db.ResolvePath via the OpsNode bypass, pgx returns context.Canceled, resolveSynthPath translates it to "", false, statSynthFile turns the resulting ErrNotExist into a setNegative entry, and subsequent reads (including the runner's retries) hit the stale negative for the full 2s TTL.
Fix: same one-line wrap at the entry of each bypass site -- ctx = decoupleFromRequestCancel(ctx). Three lines plus doc-comment blocks explaining each method's bypass.
Validation: