Skip to content

[AMD][gfx950] Add gfx950 support for DeepGeem example#2028

Merged
LeiWang1999 merged 6 commits intotile-ai:mainfrom
zhangnju:deepgemm_gfx950
Apr 12, 2026
Merged

[AMD][gfx950] Add gfx950 support for DeepGeem example#2028
LeiWang1999 merged 6 commits intotile-ai:mainfrom
zhangnju:deepgemm_gfx950

Conversation

@zhangnju
Copy link
Copy Markdown
Collaborator

@zhangnju zhangnju commented Apr 10, 2026

HI @LeiWang1999

This PR is used to add gfx950 support for Deepgeem example, extends HIP code generation to support wider MFMA accumulator register types required by gfx950 ,and includes the below changes:

  1. Add float32x32 GCC vector extension type alias (32×float)
  2. Emit correct type names for float32x16, float32x32, and bfloat16x16 for PrintType,PrintVecElemLoad/Store and PrintVecElemLoadExpr in codegen_hip.cc

Thanks

Summary by CodeRabbit

  • New Features

    • Added support for AMD ROCm GPU architecture gfx950.
    • Expanded HIP vector accumulator support to enable higher-performance MFMA code generation.
  • Improvements

    • Buffer loads now correctly handle additional leading dimensions, improving correctness for higher-rank / multi-stage buffers.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f72d555-876e-4137-b99e-13129d8fbcbc

📥 Commits

Reviewing files that changed from the base of the PR and between c63b81d and 5bab730.

📒 Files selected for processing (1)
  • src/target/codegen_hip.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_hip.cc

📝 Walkthrough

Walkthrough

Adds support for new HIP accumulator vector types (float32x16, float32x32, bfloat16x16) in HIP codegen, defines a float32x32 alias, adjusts MFMA macro buffer indexing to handle extra leading dimensions, and updates ROCm Dockerfile to include gfx950 in PYTORCH_ROCM_ARCH.

Changes

Cohort / File(s) Summary
ROCm Architecture Configuration
docker/Dockerfile.rocm
Added gfx950 to PYTORCH_ROCM_ARCH (now gfx90a;gfx942;gfx950;gfx1201;gfx1100).
HIP Vector Types
src/tl_templates/hip/common.h
Added public alias float32x32 as a 32-element float GCC vector type.
HIP Code Generation (accumulators & accessors)
src/target/codegen_hip.cc
Emit new accumulator vector types (float32x16, float32x32, bfloat16x16); lane-aware element load/store selection (vec[i] vs vec.data[i]); expanded MFMA dtype mapping; updated BroadcastNode emission and threadblock_swizzle_pattern call syntax.
MFMA Macro Generator (buffer indexing)
tilelang/intrinsics/mfma_macro_generator.py
ldmatrix_a/ldmatrix_b now compute min offsets for leading (non-2D) region dims and incorporate them into multi-dimensional buffer indexing to support higher-rank region axes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Gongen-Ali

"🐇 I hopped through code with glee tonight,
New vectors tucked in, ready for flight.
gfx950 joins the GPU crew,
Multi-dim buffers now index true,
Cheers from a rabbit—compile alright!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[AMD][gfx950] Add gfx950 support for DeepGeem example' directly summarizes the main change: adding gfx950 GPU architecture support. It is concise, specific, and clearly indicates the primary objective of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_hip.cc`:
- Around line 1226-1227: The generated template instantiation incorrectly passes
panel_size to tl::<func_name> when the helper templates expect template
parameter panel_width; update the code in src/target/codegen_hip.cc where you
emit "tl::" << func_name << "<" << panel_size << ">()" to instead pass the
panel_width template argument (e.g., use an existing panel_width variable or
compute panel_width = panel_size / gridDim.x before emission) so the emitted
instantiation matches the tl_templates/hip/threadblock_swizzle.h signature
(template<int panel_width>).

In `@tilelang/intrinsics/mfma_macro_generator.py`:
- Around line 331-332: The code currently collapses leading BufferRegion axes by
taking only r.min (e.g. A_other = [r.min for r in A_region.region[:-2]]), which
silently slices away higher-rank data; change these sites to either assert that
every leading axis has extent == 1 or propagate full indexing information:
replace the direct r.min extraction with a validation step (for each r in
A_region.region[:-2] and B_region.region[:-2] assert r.extent == 1 and
raise/abort if not) or build and pass through per-axis index expressions (keep
r.min and r.extent or construct index vectors) so higher-rank regions are
correctly addressed; apply the same fix to all similar variables and spots
(A_other, B_other and the blocks around the noted regions where r.min is used
for prefix axes).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f10421f2-536d-4c0d-a39f-5c79c5d8cdd3

📥 Commits

Reviewing files that changed from the base of the PR and between b1a88bf and 60cc1be.

📒 Files selected for processing (4)
  • docker/Dockerfile.rocm
  • src/target/codegen_hip.cc
  • src/tl_templates/hip/common.h
  • tilelang/intrinsics/mfma_macro_generator.py

Comment thread src/target/codegen_hip.cc
Comment on lines +1226 to +1227
this->stream << "const dim3 blockIdx = tl::" << func_name << "<"
<< panel_size << ">();\n";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

hdr="$(fd 'threadblock_swizzle\.h$' | head -n1)"
echo "Inspecting: ${hdr}"
sed -n '1,240p' "${hdr}"
echo
echo "--- template / panel_size matches ---"
rg -n -C2 'template\s*<[^>]*panel_size[^>]*>|panel_size' "${hdr}"

Repository: tile-ai/tilelang

Length of output: 8112


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for all threadblock_swizzle.h files ==="
fd 'threadblock_swizzle\.h$'

echo
echo "=== Checking HIP-specific version if exists ==="
if [ -f "src/tl_templates/hip/threadblock_swizzle.h" ]; then
  echo "Found HIP version:"
  sed -n '1,150p' "src/tl_templates/hip/threadblock_swizzle.h"
  echo
  echo "--- template / panel matches in HIP ---"
  rg -n 'template\s*<[^>]*>|panel' "src/tl_templates/hip/threadblock_swizzle.h" | head -20
else
  echo "No HIP-specific version found"
fi

echo
echo "=== Checking what func_name could be in codegen_hip.cc ==="
rg -B5 -A5 'threadblock_swizzle|func_name.*swizzle|rasterization2D' src/target/codegen_hip.cc | head -50

Repository: tile-ai/tilelang

Length of output: 5040


Template parameter mismatch: panel_size passed but panel_width expected.

Line 1226–1227 generates tl::<func_name><panel_size>(), but the HIP threadblock swizzle helpers in src/tl_templates/hip/threadblock_swizzle.h declare templates with template <int panel_width>. Inside these templates, panel_size is computed as a local variable (e.g., panel_size = panel_width * gridDim.x), not a template parameter. The generated code passes the computed panel_size directly as a template argument, which does not match the declared panel_width parameter. This causes a compile-time error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_hip.cc` around lines 1226 - 1227, The generated template
instantiation incorrectly passes panel_size to tl::<func_name> when the helper
templates expect template parameter panel_width; update the code in
src/target/codegen_hip.cc where you emit "tl::" << func_name << "<" <<
panel_size << ">()" to instead pass the panel_width template argument (e.g., use
an existing panel_width variable or compute panel_width = panel_size / gridDim.x
before emission) so the emitted instantiation matches the
tl_templates/hip/threadblock_swizzle.h signature (template<int panel_width>).

Comment on lines +331 to +332
# Leading dimensions (e.g. pipeline stage axis) – empty for 2-D buffers
A_other = [r.min for r in A_region.region[:-2]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the singleton-prefix assumption explicit.

Line 332 and Line 376 only keep r.min for the leading BufferRegion axes. If any of those extents is > 1, these loads silently collapse the region to its first slice. Please either assert that all prefix extents are singleton here, or plumb the extra indices through so higher-rank regions are actually addressed.

Also applies to: 348-354, 375-376, 395-405

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 331-331: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/intrinsics/mfma_macro_generator.py` around lines 331 - 332, The code
currently collapses leading BufferRegion axes by taking only r.min (e.g. A_other
= [r.min for r in A_region.region[:-2]]), which silently slices away higher-rank
data; change these sites to either assert that every leading axis has extent ==
1 or propagate full indexing information: replace the direct r.min extraction
with a validation step (for each r in A_region.region[:-2] and
B_region.region[:-2] assert r.extent == 1 and raise/abort if not) or build and
pass through per-axis index expressions (keep r.min and r.extent or construct
index vectors) so higher-rank regions are correctly addressed; apply the same
fix to all similar variables and spots (A_other, B_other and the blocks around
the noted regions where r.min is used for prefix axes).

@LeiWang1999
Copy link
Copy Markdown
Member

approved and some rocm ci related issues was not introduced by this pr and will be fixed in #2033

@LeiWang1999 LeiWang1999 merged commit 7a515b5 into tile-ai:main Apr 12, 2026
5 of 6 checks passed
kurisu6912 pushed a commit that referenced this pull request Apr 13, 2026
* add MI355 support and fix some rocm releated issues

* add MI355 support and fix some rocm releated issues

* update

* add gfx950 support for deepgemm example

* lint fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants