Skip to content

Comments

fix(redis): tighten stale TCP connection detection and add fast lease deadline#3311

Merged
waleedlatif1 merged 4 commits intostagingfrom
fix/ivm
Feb 23, 2026
Merged

fix(redis): tighten stale TCP connection detection and add fast lease deadline#3311
waleedlatif1 merged 4 commits intostagingfrom
fix/ivm

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Tighten Redis PING health check from 30s/3-failure to 15s/2-failure, shrinking dead connection detection window from ≤90s to ≤30s
  • Add 200ms hard deadline on distributed lease Redis eval via Promise.race so stale connections during the detection window fail fast and fall back to local execution instead of hanging 5s and racing the HTTP abort timeout
  • Fix retryStrategy to only log at attempt 11 and every 20 attempts after, preventing unbounded log noise during extended Redis outages

Root cause

ECS tasks route to Redis Cloud via NAT gateway. AWS NAT has a fixed 350s TCP idle timeout — connections idle longer than that get their NAT entry silently dropped. The next command on the dead socket hung until commandTimeout (5000ms), which raced exactly with the HTTP abort timeout on condition evaluation, killing requests instead of falling back.

Type of Change

  • Bug fix

Testing

Tested manually — all redis.test.ts and isolated-vm tests pass

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Feb 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Feb 23, 2026 9:14pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 23, 2026

Greptile Summary

This PR addresses Redis connection reliability issues in ECS deployments where NAT gateway idle timeouts (350s) cause silent TCP connection failures. The fix reduces stale connection detection from ≤90s to ≤30s and adds a 200ms hard deadline on Redis eval operations to prevent request hangs.

Key changes:

  • Tightened PING health check interval from 30s to 15s and max failures from 3 to 2
  • Added Promise.race with 200ms deadline on distributed lease Redis eval to fail fast on stale connections
  • Fixed memory leak by properly clearing timeout in finally block (addressed in commit 5003c1d)
  • Updated all tests to reflect new timing parameters

The implementation correctly prevents the race condition between Redis command timeout (5000ms) and HTTP abort timeout by forcing lease acquisition to fail within 200ms and fall back to local execution.

Confidence Score: 5/5

  • Safe to merge with high confidence - addresses critical production issue with proper error handling
  • All changes are well-tested with comprehensive unit tests, the memory leak fix is properly implemented with finally block cleanup, and the logic correctly addresses the NAT timeout race condition without introducing new risks
  • No files require special attention

Important Files Changed

Filename Overview
apps/sim/lib/core/config/redis.ts Tightened PING health check from 30s/3-failure to 15s/2-failure to detect stale connections faster (≤30s vs ≤90s)
apps/sim/lib/execution/isolated-vm.ts Added 200ms deadline on Redis eval with Promise.race and proper timeout cleanup in finally block to prevent hangs and memory leaks
apps/sim/lib/core/config/redis.test.ts Updated test timings to match new 15s ping interval and 2-failure threshold

Last reviewed commit: 5003c1d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 23, 2026

Additional Comments (1)

apps/sim/lib/execution/isolated-vm.ts
need to clear timeout after Promise.race completes

  try {
    const result = await Promise.race([
      redis.eval(
        script,
        1,
        key,
        now.toString(),
        DISTRIBUTED_MAX_INFLIGHT_PER_OWNER.toString(),
        expiresAt.toString(),
        leaseId,
        leaseTtlMs.toString()
      ),
      deadline,
    ])
    if (timeoutId.current) clearTimeout(timeoutId.current)
    return Number(result) === 1 ? 'acquired' : 'limit_exceeded'
  } catch (error) {
    if (timeoutId.current) clearTimeout(timeoutId.current)
    logger.error('Failed to acquire distributed owner lease', { ownerKey, error })
    return 'unavailable'
  }

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@waleedlatif1 waleedlatif1 merged commit 132fef0 into staging Feb 23, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/ivm branch February 23, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant