fix: zero-initialize KV cache to prevent NaN in fallback attention#1628
Open
Gunther-Schulz wants to merge 1 commit intodeepbeepmeep:mainfrom
Open
fix: zero-initialize KV cache to prevent NaN in fallback attention#1628Gunther-Schulz wants to merge 1 commit intodeepbeepmeep:mainfrom
Gunther-Schulz wants to merge 1 commit intodeepbeepmeep:mainfrom
Conversation
torch.empty() leaves uninitialized GPU memory in cache slots that are
never written during a sequence (e.g. the tail of the last block).
_flash_attention_fallback_decode reads all allocated blocks and masks
out invalid positions with -inf, but NaN + (-inf) = NaN rather than
-inf, causing NaN to propagate through softmax and corrupt the output.
Switching to torch.zeros() ensures unwritten slots contain 0, so the
masking works correctly (0 + -inf = -inf → softmax weight = 0).
This fixes garbage output ("A!!!...") from Qwen TI mode image
captioning via the vLLM embedded prefill path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
Summary
torch.empty()leaves uninitialized GPU memory in KV cache slots that are never written (e.g. the tail of the last block when a sequence doesn't fill it completely)_flash_attention_fallback_decodereads all allocated blocks and masks invalid positions with-infin the attention biasNaN + (-inf) = NaN, not-inf— so when uninitialized slots contain NaN/Inf, the masking fails and NaN propagates through softmax, corrupting the entire attention outputtorch.zeros()ensures unwritten slots contain0:0 + (-inf) = -inf→ softmax weight = 0 → no contaminationThis fixes garbage output (e.g.
"A!!!...") from Qwen TI mode image captioning via the vLLM embedded prefill/decode path. The bug was non-deterministic: it only manifested whentorch.empty()happened to place NaN/Inf in the uninitialized tail of a specific layer's cache (in practice, always model layer 23 / attention layer_idx 5 in the 9B hybrid model).Test plan
🤖 Generated with Claude Code