Skip to content

fix: zero-initialize KV cache to prevent NaN in fallback attention#1628

Open
Gunther-Schulz wants to merge 1 commit intodeepbeepmeep:mainfrom
Gunther-Schulz:fix/kvcache-nan-uninitialized
Open

fix: zero-initialize KV cache to prevent NaN in fallback attention#1628
Gunther-Schulz wants to merge 1 commit intodeepbeepmeep:mainfrom
Gunther-Schulz:fix/kvcache-nan-uninitialized

Conversation

@Gunther-Schulz
Copy link
Copy Markdown

@Gunther-Schulz Gunther-Schulz commented Mar 21, 2026

🤖 Generated with Claude Code

Summary

  • torch.empty() leaves uninitialized GPU memory in KV cache slots that are never written (e.g. the tail of the last block when a sequence doesn't fill it completely)
  • _flash_attention_fallback_decode reads all allocated blocks and masks invalid positions with -inf in the attention bias
  • Bug: NaN + (-inf) = NaN, not -inf — so when uninitialized slots contain NaN/Inf, the masking fails and NaN propagates through softmax, corrupting the entire attention output
  • Switching to torch.zeros() ensures unwritten slots contain 0: 0 + (-inf) = -inf → softmax weight = 0 → no contamination

This fixes garbage output (e.g. "A!!!...") from Qwen TI mode image captioning via the vLLM embedded prefill/decode path. The bug was non-deterministic: it only manifested when torch.empty() happened to place NaN/Inf in the uninitialized tail of a specific layer's cache (in practice, always model layer 23 / attention layer_idx 5 in the 9B hybrid model).

Test plan

  • Run prompt enhancer in TI (text + image) mode with an image input and verify captions are coherent instead of garbage
  • Verify T-only mode prompt enhancement still works normally

🤖 Generated with Claude Code

torch.empty() leaves uninitialized GPU memory in cache slots that are
never written during a sequence (e.g. the tail of the last block).
_flash_attention_fallback_decode reads all allocated blocks and masks
out invalid positions with -inf, but NaN + (-inf) = NaN rather than
-inf, causing NaN to propagate through softmax and corrupt the output.

Switching to torch.zeros() ensures unwritten slots contain 0, so the
masking works correctly (0 + -inf = -inf → softmax weight = 0).

This fixes garbage output ("A!!!...") from Qwen TI mode image
captioning via the vLLM embedded prefill path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant