Skip to content

fix(fuse): close OpsNode bypass gap in FUSE_INTERRUPT decoupling#38

Merged
mfreed merged 1 commit into
mainfrom
fix/fuse-stale-negative-cache
May 22, 2026
Merged

fix(fuse): close OpsNode bypass gap in FUSE_INTERRUPT decoupling#38
mfreed merged 1 commit into
mainfrom
fix/fuse-stale-negative-cache

Conversation

@mfreed
Copy link
Copy Markdown
Member

@mfreed mfreed commented May 22, 2026

Three OpsNode methods (Lookup, Getattr, Readlink) bypass FSAdapter and call Operations directly. The previous fix (4c7b4c1) wrapped FSAdapter.{Stat, ReadDir, ReadFile} with decoupleFromRequestCancel, but the bypass paths kept propagating the kernel's per-request context all the way into pgx.

Symptom: docker-FUSE stress runs hit a transient ENOENT for a recently-created/renamed file, recovering only after the 2s statCacheTTL. Pattern: kernel sends FUSE_INTERRUPT during a LOOKUP (typically from Go's SIGURG goroutine preemption under GC pressure), go-fuse cancels the request ctx, the cancellation reaches db.ResolvePath via the OpsNode bypass, pgx returns context.Canceled, resolveSynthPath translates it to "", false, statSynthFile turns the resulting ErrNotExist into a setNegative entry, and subsequent reads (including the runner's retries) hit the stale negative for the full 2s TTL.

Fix: same one-line wrap at the entry of each bypass site -- ctx = decoupleFromRequestCancel(ctx). Three lines plus doc-comment blocks explaining each method's bypass.

Validation:

  • Before: ~20% of docker-FUSE 1500-iter stress runs hit the cache-bug signature (probe: "DB has row; FS recovered visibility at offset 2000ms").
  • After: 0/8 attempts hit it across ~12000 total iterations. Two attempts hit DIFFERENT failure shapes (close-time EIO, rename mid-flight ENOENT) -- pre-existing different bugs, out of scope.

Three OpsNode methods (Lookup, Getattr, Readlink) bypass FSAdapter and
call Operations directly. The previous fix (4c7b4c1) wrapped
FSAdapter.{Stat, ReadDir, ReadFile} with decoupleFromRequestCancel,
but the bypass paths kept propagating the kernel's per-request
context all the way into pgx.

Symptom: docker-FUSE stress runs hit a transient ENOENT for a
recently-created/renamed file, recovering only after the 2s
statCacheTTL. Pattern: kernel sends FUSE_INTERRUPT during a LOOKUP
(typically from Go's SIGURG goroutine preemption under GC pressure),
go-fuse cancels the request ctx, the cancellation reaches
db.ResolvePath via the OpsNode bypass, pgx returns
context.Canceled, resolveSynthPath translates it to "", false,
statSynthFile turns the resulting ErrNotExist into a setNegative
entry, and subsequent reads (including the runner's retries) hit
the stale negative for the full 2s TTL.

Fix: same one-line wrap at the entry of each bypass site --
ctx = decoupleFromRequestCancel(ctx). Three lines plus doc-comment
blocks explaining each method's bypass.

Validation:
* Before: ~20% of docker-FUSE 1500-iter stress runs hit the cache-bug
  signature (probe: "DB has row; FS recovered visibility at offset
  2000ms").
* After: 0/8 attempts hit it across ~12000 total iterations. Two
  attempts hit DIFFERENT failure shapes (close-time EIO, rename
  mid-flight ENOENT) -- pre-existing different bugs, out of scope.
@mfreed mfreed merged commit b5cf146 into main May 22, 2026
2 checks passed
@mfreed mfreed deleted the fix/fuse-stale-negative-cache branch May 22, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant