[AMD][gfx950] Add gfx950 support for DeepGeem example by zhangnju · Pull Request #2028 · tile-ai/tilelang

zhangnju · 2026-04-10T09:18:28Z

HI @LeiWang1999

This PR is used to add gfx950 support for Deepgeem example, extends HIP code generation to support wider MFMA accumulator register types required by gfx950 ,and includes the below changes:

Add float32x32 GCC vector extension type alias (32×float)
Emit correct type names for float32x16, float32x32, and bfloat16x16 for PrintType,PrintVecElemLoad/Store and PrintVecElemLoadExpr in codegen_hip.cc

Thanks

Summary by CodeRabbit

New Features
- Added support for AMD ROCm GPU architecture gfx950.
- Expanded HIP vector accumulator support to enable higher-performance MFMA code generation.
Improvements
- Buffer loads now correctly handle additional leading dimensions, improving correctness for higher-rank / multi-stage buffers.

github-actions · 2026-04-10T09:18:38Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-04-10T09:18:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f72d555-876e-4137-b99e-13129d8fbcbc

📥 Commits

Reviewing files that changed from the base of the PR and between c63b81d and 5bab730.

📒 Files selected for processing (1)

src/target/codegen_hip.cc

🚧 Files skipped from review as they are similar to previous changes (1)

src/target/codegen_hip.cc

📝 Walkthrough

Walkthrough

Adds support for new HIP accumulator vector types (float32x16, float32x32, bfloat16x16) in HIP codegen, defines a float32x32 alias, adjusts MFMA macro buffer indexing to handle extra leading dimensions, and updates ROCm Dockerfile to include gfx950 in PYTORCH_ROCM_ARCH.

Changes

Cohort / File(s)	Summary
ROCm Architecture Configuration `docker/Dockerfile.rocm`	Added `gfx950` to `PYTORCH_ROCM_ARCH` (now `gfx90a;gfx942;gfx950;gfx1201;gfx1100`).
HIP Vector Types `src/tl_templates/hip/common.h`	Added public alias `float32x32` as a 32-element float GCC vector type.
HIP Code Generation (accumulators & accessors) `src/target/codegen_hip.cc`	Emit new accumulator vector types (`float32x16`, `float32x32`, `bfloat16x16`); lane-aware element load/store selection (`vec[i]` vs `vec.data[i]`); expanded MFMA dtype mapping; updated BroadcastNode emission and threadblock_swizzle_pattern call syntax.
MFMA Macro Generator (buffer indexing) `tilelang/intrinsics/mfma_macro_generator.py`	`ldmatrix_a`/`ldmatrix_b` now compute min offsets for leading (non-2D) region dims and incorporate them into multi-dimensional buffer indexing to support higher-rank region axes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly #1794: Similar change to ldmatrix_a/ldmatrix_b indexing to include leading region dimensions.
[AMD] support mfma i32_16x16x32_i8 #800: Related edits to MFMA codegen and dtype/MFMA mapping in HIP codegen.
[AMD] Enable FA2 fwd on AMD MI300X #1406: Changes touching docker/Dockerfile.rocm PYTORCH_ROCM_ARCH (gfx950).

Suggested reviewers

Gongen-Ali

"🐇 I hopped through code with glee tonight,
New vectors tucked in, ready for flight.
gfx950 joins the GPU crew,
Multi-dim buffers now index true,
Cheers from a rabbit—compile alright!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[AMD][gfx950] Add gfx950 support for DeepGeem example' directly summarizes the main change: adding gfx950 GPU architecture support. It is concise, specific, and clearly indicates the primary objective of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_hip.cc`:
- Around line 1226-1227: The generated template instantiation incorrectly passes
panel_size to tl::<func_name> when the helper templates expect template
parameter panel_width; update the code in src/target/codegen_hip.cc where you
emit "tl::" << func_name << "<" << panel_size << ">()" to instead pass the
panel_width template argument (e.g., use an existing panel_width variable or
compute panel_width = panel_size / gridDim.x before emission) so the emitted
instantiation matches the tl_templates/hip/threadblock_swizzle.h signature
(template<int panel_width>).

In `@tilelang/intrinsics/mfma_macro_generator.py`:
- Around line 331-332: The code currently collapses leading BufferRegion axes by
taking only r.min (e.g. A_other = [r.min for r in A_region.region[:-2]]), which
silently slices away higher-rank data; change these sites to either assert that
every leading axis has extent == 1 or propagate full indexing information:
replace the direct r.min extraction with a validation step (for each r in
A_region.region[:-2] and B_region.region[:-2] assert r.extent == 1 and
raise/abort if not) or build and pass through per-axis index expressions (keep
r.min and r.extent or construct index vectors) so higher-rank regions are
correctly addressed; apply the same fix to all similar variables and spots
(A_other, B_other and the blocks around the noted regions where r.min is used
for prefix axes).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f10421f2-536d-4c0d-a39f-5c79c5d8cdd3

📥 Commits

Reviewing files that changed from the base of the PR and between b1a88bf and 60cc1be.

📒 Files selected for processing (4)

docker/Dockerfile.rocm
src/target/codegen_hip.cc
src/tl_templates/hip/common.h
tilelang/intrinsics/mfma_macro_generator.py

coderabbitai · 2026-04-10T09:30:32Z

+    this->stream << "const dim3 blockIdx = tl::" << func_name << "<"
+                 << panel_size << ">();\n";


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail hdr="$(fd 'threadblock_swizzle\.h$' | head -n1)" echo "Inspecting: ${hdr}" sed -n '1,240p' "${hdr}" echo echo "--- template / panel_size matches ---" rg -n -C2 'template\s*<[^>]*panel_size[^>]*>|panel_size' "${hdr}"

Repository: tile-ai/tilelang

Length of output: 8112

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Searching for all threadblock_swizzle.h files ===" fd 'threadblock_swizzle\.h$' echo echo "=== Checking HIP-specific version if exists ===" if [ -f "src/tl_templates/hip/threadblock_swizzle.h" ]; then echo "Found HIP version:" sed -n '1,150p' "src/tl_templates/hip/threadblock_swizzle.h" echo echo "--- template / panel matches in HIP ---" rg -n 'template\s*<[^>]*>|panel' "src/tl_templates/hip/threadblock_swizzle.h" | head -20 else echo "No HIP-specific version found" fi echo echo "=== Checking what func_name could be in codegen_hip.cc ===" rg -B5 -A5 'threadblock_swizzle|func_name.*swizzle|rasterization2D' src/target/codegen_hip.cc | head -50

Repository: tile-ai/tilelang

Length of output: 5040

Template parameter mismatch: panel_size passed but panel_width expected.

Line 1226–1227 generates tl::<func_name><panel_size>(), but the HIP threadblock swizzle helpers in src/tl_templates/hip/threadblock_swizzle.h declare templates with template <int panel_width>. Inside these templates, panel_size is computed as a local variable (e.g., panel_size = panel_width * gridDim.x), not a template parameter. The generated code passes the computed panel_size directly as a template argument, which does not match the declared panel_width parameter. This causes a compile-time error.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_hip.cc` around lines 1226 - 1227, The generated template instantiation incorrectly passes panel_size to tl::<func_name> when the helper templates expect template parameter panel_width; update the code in src/target/codegen_hip.cc where you emit "tl::" << func_name << "<" << panel_size << ">()" to instead pass the panel_width template argument (e.g., use an existing panel_width variable or compute panel_width = panel_size / gridDim.x before emission) so the emitted instantiation matches the tl_templates/hip/threadblock_swizzle.h signature (template<int panel_width>).

coderabbitai · 2026-04-10T09:30:32Z

+        # Leading dimensions (e.g. pipeline stage axis) – empty for 2-D buffers
+        A_other = [r.min for r in A_region.region[:-2]]


⚠️ Potential issue | 🟡 Minor

Make the singleton-prefix assumption explicit.

Line 332 and Line 376 only keep r.min for the leading BufferRegion axes. If any of those extents is > 1, these loads silently collapse the region to its first slice. Please either assert that all prefix extents are singleton here, or plumb the extra indices through so higher-rank regions are actually addressed.

Also applies to: 348-354, 375-376, 395-405

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 331-331: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tilelang/intrinsics/mfma_macro_generator.py` around lines 331 - 332, The code currently collapses leading BufferRegion axes by taking only r.min (e.g. A_other = [r.min for r in A_region.region[:-2]]), which silently slices away higher-rank data; change these sites to either assert that every leading axis has extent == 1 or propagate full indexing information: replace the direct r.min extraction with a validation step (for each r in A_region.region[:-2] and B_region.region[:-2] assert r.extent == 1 and raise/abort if not) or build and pass through per-axis index expressions (keep r.min and r.extent or construct index vectors) so higher-rank regions are correctly addressed; apply the same fix to all similar variables and spots (A_other, B_other and the blocks around the noted regions where r.min is used for prefix axes).

…emm_gfx950

LeiWang1999 · 2026-04-12T17:27:06Z

approved and some rocm ci related issues was not introduced by this pr and will be fixed in #2033

* add MI355 support and fix some rocm releated issues * add MI355 support and fix some rocm releated issues * update * add gfx950 support for deepgemm example * lint fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

zhangnju added 4 commits April 9, 2026 02:48

add MI355 support and fix some rocm releated issues

a70cce6

add MI355 support and fix some rocm releated issues

3f8b75f

update

4503bfa

add gfx950 support for deepgemm example

60cc1be

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

LeiWang1999 added 2 commits April 11, 2026 16:18

Merge branch 'main' of https://github.com/tile-ai/tilelang into deepg…

c63b81d

…emm_gfx950

lint fix

5bab730

LeiWang1999 merged commit 7a515b5 into tile-ai:main Apr 12, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][gfx950] Add gfx950 support for DeepGeem example#2028

[AMD][gfx950] Add gfx950 support for DeepGeem example#2028
LeiWang1999 merged 6 commits intotile-ai:mainfrom
zhangnju:deepgemm_gfx950

zhangnju commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

LeiWang1999 commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		this->stream << "const dim3 blockIdx = tl::" << func_name << "<"
		<< panel_size << ">();\n";

		# Leading dimensions (e.g. pipeline stage axis) – empty for 2-D buffers
		A_other = [r.min for r in A_region.region[:-2]]

Conversation

zhangnju commented Apr 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangnju commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading