Skip to content

Fix FP16 overflow in GQA attention and concat_past_present buffer overflow#4677

Open
aditya-dl wants to merge 2 commits intoROCm:developfrom
aditya-dl:fix-fp16-overflow-gqa-attention
Open

Fix FP16 overflow in GQA attention and concat_past_present buffer overflow#4677
aditya-dl wants to merge 2 commits intoROCm:developfrom
aditya-dl:fix-fp16-overflow-gqa-attention

Conversation

@aditya-dl
Copy link
Contributor

@aditya-dl aditya-dl commented Mar 16, 2026

Motivation

Qwen1.5-architecture models produce garbage output when running FP16 inference through MIGraphX. Two root causes were identified:

  1. The dot-to-softmax attention chain overflows in FP16 precision

Technical Details

FP16 overflow fix: Extends find_softmax_base_ops to walk backwards through the attention chain (mul, where, broadcast, convert) to find the feeding dot instruction. The entire dot-to-softmax range is upcast to FP32, preventing overflow in attention score computation. Bool-type inputs (where conditions) are excluded from conversion.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@aditya-dl aditya-dl requested a review from causten as a code owner March 16, 2026 19:27
@pfultz2
Copy link
Collaborator

pfultz2 commented Mar 17, 2026

Unit tests need to be added and the CI failure fixed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two root causes of garbage output in Qwen1.5-architecture models during FP16 inference: (1) FP16 overflow in the dot→softmax attention chain, and (2) a buffer overflow in concat_past_present during prompt processing when the sequence length exceeds the past cache size.

Changes:

  • Extends find_softmax_base_ops in rewrite_reduce.cpp to walk backward from softmax through mul/where/broadcast/convert to find a feeding dot instruction, upcasting the entire range to FP32 (with bool inputs excluded).
  • Fixes concat_past_present buffer sizing across the operator definition, GPU lowering, JIT compiler, and GPU kernel so that the output buffer is properly sized when sequence_length > past_cache_sequence_length.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/rewrite_reduce.cpp Adds backward walk from softmax to dot for FP32 upcast range extension; skips bool inputs in conversion
src/include/migraphx/op/concat_past_present.hpp Updates compute_shape to return larger shape when needed; uses std::max for present_buffer_sequence_length
src/targets/gpu/lowering.cpp Allocates properly-sized GPU buffer when output shape exceeds past cache shape
src/targets/gpu/jit/concat_past_present.cpp Adjusts JIT compiler output shape to match larger buffer when needed
src/targets/gpu/kernels/include/migraphx/kernels/concat_past_present.hpp GPU kernel uses max(past_seq, seq_len) for present buffer sequence length

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@pfultz2
Copy link
Collaborator

pfultz2 commented Mar 18, 2026

I think you should put the softmax change in a separate PR as I dont think we will merge the concat_past_present change.

@aditya-dl aditya-dl force-pushed the fix-fp16-overflow-gqa-attention branch from 6f03700 to 9da5d9f Compare March 19, 2026 17:39
@aditya-dl
Copy link
Contributor Author

@pfultz2 Updated this PR with only the softmax change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants