diff --git a/src/prompts/prompt_for_generation.py b/src/prompts/prompt_for_generation.py index 6d07461..292021e 100644 --- a/src/prompts/prompt_for_generation.py +++ b/src/prompts/prompt_for_generation.py @@ -1,131 +1,108 @@ - prompt = """ -You are an expert Python programmer specializing in NVIDIA Triton kernels, specifically targeting **AMD GPUs using the ROCm environment**. -Your task is to generate a Python code snippet containing a Triton kernel based on the following request: +You are an expert Python programmer specializing in Triton kernels for **AMD GPUs (ROCm)**. +Generate a **single, complete, syntactically-correct Python code block** that implements the requested kernel. **Target Platform:** AMD GPU (ROCm) **Request:** {instruction} -**CRITICAL FUNCTION INFORMATION:** -Based on analysis, the implementation requires these EXACT function signatures: +**CRITICAL FUNCTION INFORMATION (do NOT change):** +Use EXACTLY the following function signatures: {function_signatures} -**Output Requirements:** -1. **AMD Compatibility:** Generate code compatible with AMD GPUs and ROCm. **DO NOT use CUDA-specific features or functions (e.g., `tl.libdevice`).** -2. **Complete Code:** Generate a single, complete, and syntactically correct Python code block. -3. **Triton Kernel:** The core logic must be implemented within a Triton kernel function decorated with `@triton.jit`. -4. **Imports:** ALWAYS include necessary imports at the beginning: - ```python - import torch - import triton - import triton.language as tl - # import math # Only if standard math functions are truly needed outside the kernel - ``` - Include other imports *only if absolutely necessary*. -5. **Function Signature (CRITICAL):** - * Define EACH function with EXACTLY the signature shown above. - * DO NOT change parameter names, counts, or order. - * Ensure all parameters in function calls match their function definitions. - * **Type Hints:** Use PyTorch tensor type hints (e.g., `x: torch.Tensor`) for tensor arguments. **DO NOT use `tl.pointer`**. Use standard Python types (e.g., `int`, `float`) or `tl.constexpr` for others. - * **`constexpr`:** Use `tl.constexpr` **ONLY** for arguments that *must* be known at compile time, typically block sizes (like `BLOCK_SIZE`, `BLOCK_M`) or flags that change the kernel's structure (like `IS_EVEN_K`). Simple numerical values like `eps` or `dropout_p` are usually *not* `constexpr`. -6. **Data Types:** Be precise with data types inside the kernel (e.g., `tl.float16`, `tl.float32`, `tl.int32`). Ensure type compatibility. Assume input tensors might be `torch.float16` or `torch.float32` unless specified otherwise. Pay attention to potential type promotion/conversion needs (e.g., using `.to(tl.float32)` for accumulations). -7. **Triton Operations:** - * Use Triton language functions correctly (`tl.load`, `tl.store`, `tl.dot`, `tl.arange`, `tl.program_id`, `tl.where`, `tl.atomic_cas`, etc.). - * **Pointers & Masks:** Be extremely careful when constructing pointers using offsets and strides. Ensure masks in `tl.load`/`tl.store` are correctly computed and match pointer dimensions. Avoid `ValueError: Mask argument cannot be block type...` or `ValueError: Unsupported ptr type...`. - * **`tl.dot`:** Ensure inputs are 2D blocks and have compatible types (e.g., float16, bfloat16). Int32 is generally not supported directly as input. - * **`tl.arange`:** Arguments `start` and `end` **must be `tl.constexpr`**. - * **Math:** Use functions from `tl.math` where available (e.g., `tl.math.exp`, `tl.math.sqrt`). Check function existence; avoid assuming functions like `tanh` or `log1p` exist if they don't in `tl.math`. -8. **Triton Version:** Assume Triton version 3.1.0 or later. +**Hard Requirements (MUST follow):** +1) **ROCm-only:** Do NOT use CUDA-only features (e.g., `tl.libdevice`, CUDA streams/APIs). +2) **Single full code block:** The output must be one fenced Python block containing: + - Required imports only: + ```python + import torch + import triton + import triton.language as tl + # import math # only if math.* is used outside kernels + ``` + - The public Python function(s) with EXACT signatures from above. + - One or more `@triton.jit` kernels that the public function(s) launch. +3) **Signatures locked:** Do not change function names, parameter names, counts, or order. + - Calls must match definitions 1:1 (no missing or extra args). + - Use `torch.Tensor` type hints for tensor params; use Python scalars or `tl.constexpr` for compile-time meta-params only. +4) **Types & numerics:** + - Be explicit with Triton dtypes (`tl.float16`, `tl.float32`, `tl.int32`, etc.). + - For reductions / accumulations, prefer `tl.float32` and cast back to output dtype when storing. + - Avoid unsupported math; if needed, use `tl.math` (e.g., `tl.math.exp`, `tl.math.sqrt`). +5) **Triton ops & shapes:** + - Use `tl.load` / `tl.store` with correct pointer arithmetic and **matching mask shapes**; avoid OOB. + - `tl.arange` bounds must be `tl.constexpr`. + - `tl.dot` inputs must be 2D tiles with supported dtypes (fp16/bf16 to fp32 accumulate). +6) **Grid & program IDs:** + - Make grid dimensionality consistent with how you read `tl.program_id(n)`. + - If grid is 1D, do not read a second program dimension. +7) **Triton version:** Assume Triton >= 3.1.0. -**FINAL VERIFICATION:** -Before completing, verify: -1. ALL functions defined in the code have EXACT signatures matching the required function signatures above. -2. ALL function calls exactly match their definitions in terms of parameter counts and names. -3. No functions are called without being defined. -4. No parameters are missing from your implementations. +**Final Self-Check (before you finish):** +- [ ] All public functions exist and match EXACT signatures listed above. +- [ ] Every call matches its callee’s params (names/order/count). +- [ ] No undefined names; no missing imports; no placeholder variables left. +- [ ] All pointer masks are correctly shaped; no rank mismatches. +- [ ] Code compiles for ROCm (no CUDA-only APIs). **Generated AMD ROCm Compatible Triton Kernel Code:** """ - prompt_rocm = """ -You are an expert Python programmer specializing in NVIDIA Triton kernels, specifically targeting **AMD GPUs using the ROCm environment**. -Your task is to generate a Python code snippet containing a Triton kernel based on the following request: +You are an expert Python programmer specializing in Triton kernels for **AMD GPUs (ROCm)**. +Generate a **single, complete, syntactically-correct Python code block** that implements the requested kernel with attention to performance. **Target Platform:** AMD GPU (ROCm) **Request:** {instruction} -**CRITICAL FUNCTION INFORMATION:** -Based on analysis, the implementation requires these EXACT function signatures: +**CRITICAL FUNCTION INFORMATION (do NOT change):** +Use EXACTLY the following function signatures: {function_signatures} -**Output Requirements:** -1. **AMD Compatibility:** Generate code compatible with AMD GPUs and ROCm. **DO NOT use CUDA-specific features or functions (e.g., `tl.libdevice`).** -2. **Complete Code:** Generate a single, complete, and syntactically correct Python code block. -3. **Triton Kernel:** The core logic must be implemented within a Triton kernel function decorated with `@triton.jit`. -4. **Imports:** ALWAYS include necessary imports at the beginning: - ```python - import torch - import triton - import triton.language as tl - # import math # Only if standard math functions are truly needed outside the kernel - ``` - Include other imports *only if absolutely necessary*. -5. **Function Signature (CRITICAL):** - * Define EACH function with EXACTLY the signature shown above. - * DO NOT change parameter names, counts, or order. - * Ensure all parameters in function calls match their function definitions. - * **Type Hints:** Use PyTorch tensor type hints (e.g., `x: torch.Tensor`) for tensor arguments. **DO NOT use `tl.pointer`**. Use standard Python types (e.g., `int`, `float`) or `tl.constexpr` for others. - * **`constexpr`:** Use `tl.constexpr` **ONLY** for arguments that *must* be known at compile time, typically block sizes (like `BLOCK_SIZE`, `BLOCK_M`) or flags that change the kernel's structure (like `IS_EVEN_K`). Simple numerical values like `eps` or `dropout_p` are usually *not* `constexpr`. -6. **Data Types:** Be precise with data types inside the kernel (e.g., `tl.float16`, `tl.float32`, `tl.int32`). Ensure type compatibility. Assume input tensors might be `torch.float16` or `torch.float32` unless specified otherwise. Pay attention to potential type promotion/conversion needs (e.g., using `.to(tl.float32)` for accumulations). -7. **Triton Operations:** - * Use Triton language functions correctly (`tl.load`, `tl.store`, `tl.dot`, `tl.arange`, `tl.program_id`, `tl.where`, `tl.atomic_cas`, etc.). - * **Pointers & Masks:** Be extremely careful when constructing pointers using offsets and strides. Ensure masks in `tl.load`/`tl.store` are correctly computed and match pointer dimensions. Avoid `ValueError: Mask argument cannot be block type...` or `ValueError: Unsupported ptr type...`. - * **`tl.dot`:** Ensure inputs are 2D blocks and have compatible types (e.g., float16, bfloat16). Int32 is generally not supported directly as input. - * **`tl.arange`:** Arguments `start` and `end` **must be `tl.constexpr`**. - * **Math:** Use functions from `tl.math` where available (e.g., `tl.math.exp`, `tl.math.sqrt`). Check function existence; avoid assuming functions like `tanh` or `log1p` exist if they don't in `tl.math`. -8. **Triton Version:** Assume Triton version 3.2.0 or later. -9. Maximize performance by exploring the following: -i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. -ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. -iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints. -Primary Autotuning Fields (Mandatory) -1. BLOCK_M, BLOCK_N, BLOCK_K - * Tile sizes for GEMM or other tensor contractions. - * Larger blocks improve compute density, but reduce grid-level parallelism. - * Explore wide range of values like: - * BLOCK: [32, ..., 128, ..., 2048, ...] - * Adjust based on memory reuse and L2 cache locality. -2. num_stages=n - * Controls pipeline depth for kernel execution. - * Rules for setting this: - * 1 if no GEMM. - * 2 if a single GEMM (e.g., GEMM + ReLU). - * 1 if two GEMMs are fused (e.g., Flash Attention). - * Optimize for latency and execution overlap. -3. num_warps - * Controls number of warps (groups of 64 threads) to launch per block. - * If it is too low then underutilization -> kernel runs slow. - * If it is too high then register spill happens and shared memory is overused -> kernel runs slow. - * You must choose a sweet spot by trying out integer range of 1 to 16. - * You MUST NOT try the range beyond 16, it is NOT VALID. -Examples of Autotuning Setup -Here's how Triton kernels should be decorated to allow autotuning: - * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this. - * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time. - * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel. -For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances. - -NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc. +**Hard Requirements (MUST follow):** +1) **ROCm-only:** Do NOT use CUDA-only features (e.g., `tl.libdevice`, CUDA streams/APIs). +2) **Single full code block:** The output must be one fenced Python block containing: + - Required imports only: + ```python + import torch + import triton + import triton.language as tl + # import math # only if math.* is used outside kernels + ``` + - The public Python function(s) with EXACT signatures from above. + - One or more `@triton.jit` kernels that the public function(s) launch. +3) **Signatures locked:** Do not change function names, parameter names, counts, or order. + - Calls must match definitions 1:1 (no missing or extra args). + - Use `torch.Tensor` type hints for tensor params; use Python scalars or `tl.constexpr` for compile-time meta-params only. +4) **Types & numerics:** + - Be explicit with Triton dtypes (`tl.float16`, `tl.float32`, `tl.int32`, etc.). + - Prefer `tl.float32` accumulation for numerical stability; cast to output dtype on store. + - Use `tl.math` where appropriate (e.g., `tl.math.exp`, `tl.math.sqrt`). +5) **Triton ops & memory:** + - Use `tl.load`/`tl.store` with correct pointer arithmetic and **proper masks**; avoid OOB and mask–ptr rank mismatch. + - `tl.arange` bounds must be `tl.constexpr`. + - Ensure coalesced accesses and avoid bank conflicts; use tiling/blocking for reuse. + - `tl.dot` only with supported dtypes (e.g., fp16/bf16), accumulate in fp32. +6) **Grid & program IDs:** + - Keep grid dimensionality consistent with accesses to `tl.program_id(n)`. Do not read a second dimension if the launch grid is 1D. +7) **Autotuning for performance (where meaningful):** + - Provide Triton autotune configs exploring: + - BLOCK sizes (e.g., `BLOCK_M`, `BLOCK_N`, `BLOCK_K`) across a reasonable range (e.g., 32..256 or problem-appropriate). + - `num_warps` in [1..16] (do not exceed 16). + - `num_stages` with typical values {1, 2, 3}, chosen by fusion depth and pipeline overlap. + - Ensure meta-params used in `triton.Config` are NOT passed as kernel runtime arguments. + - Favor coalescing, occupancy, and cache locality; reduce register pressure to avoid spills. +8) **Triton version:** Assume Triton >= 3.2.0. -def grid(args: dict[str, Any]) -> tuple[int]: - # This creates a 1D grid of size (C * D, ) - return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), ) +**Final Self-Check (before you finish):** +- [ ] All public functions exist and match EXACT signatures listed above. +- [ ] Every call matches its callee’s params (names/order/count). +- [ ] No undefined names; no missing imports; no placeholder variables left. +- [ ] Pointer masks correctly shaped; grid dims match `tl.program_id` usage. +- [ ] No CUDA-only APIs; valid on ROCm. +- [ ] If autotune is provided, configs are sensible and compile-time meta-params are `tl.constexpr`. -The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid: -pid_m = tl.program_id(0) # Gets the ID for the first dimension -pid_n = tl.program_id(1) # Tries to get ID for a non-existent second dimension -""" \ No newline at end of file +**Generated AMD ROCm Compatible Triton Kernel Code:** +""" diff --git a/src/prompts/prompt_for_reflection.py b/src/prompts/prompt_for_reflection.py index fe3f936..852838a 100644 --- a/src/prompts/prompt_for_reflection.py +++ b/src/prompts/prompt_for_reflection.py @@ -1,285 +1,166 @@ +# Create a clean, ASCII-only Python file containing the 6 prompt strings. + + prompt = """ -You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights -on why the solution failed and how it could be improved. Be specific about the issues found. +You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and explain +why the solution failed and how to improve it. Be concrete and specific. **Original problem:** - {problem} **Attempted solution:** - {solution} -**Test results:** - +**Test results / error logs:** {test_result} -**Important Instructions:** -- Think before writing the reflection and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" - +**Reflection requirements (MUST address succinctly):** +- Root causes: shapes/strides/dtypes/masks, grid dimension vs tl.program_id, tl.arange constexpr bounds, pointer-mask rank match, out-of-bounds. +- Triton specifics: correct use of tl.load/tl.store, reduction correctness (associativity, fp32 accumulation), correct tile shapes for tl.dot. +- Numerical stability: overflow/underflow, eps handling, NaN/Inf creation paths. +- Concrete fixes: point to exact lines/sections and describe minimal code changes; do NOT change function names. +- Quick checklist: signatures exact; calls match definitions; no undefined names; grid dims consistent; no CUDA-only APIs. + +**Output format:** +Wrap your reflection ONLY in a fenced block with the tag reflection, for example: +three backticks + reflection + +three backticks +No extra commentary outside the block. """ prompt_exe = """ -You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights -on why the solution failed and how it could be improved. Be specific about the issues found. -Runnable test is used to test if the code can be successfully executed. -Correctness test is used to test if the output of the code is correct, i.e. if the code does implement the functionality required in the original problem. +You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed tests and explain +why the solution failed and how to fix it. Distinguish between runnability and correctness issues. -**Original problem:** +Runnable test checks whether the code executes without crashing. +Correctness test checks whether outputs match the required functionality. +**Original problem:** {problem} **Attempted solution:** - {solution} **Results for runnable test:** - {call_test_result} **Results for correctness test:** - {exe_test_result} -**Important Instructions:** -- Think before writing the reflection and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" +**Reflection requirements (MUST address):** +- Runnable failures: compilation errors, undefined names, illegal masks, grid/program_id mismatch, invalid tl.arange bounds, CUDA-only usage. +- Correctness failures: wrong indexing/strides, mask shape mismatch, reduction order/precision, dtype cast issues, boundary conditions. +- Triton check: tl.load/tl.store masks match pointer shapes; tile sizes align with problem sizes; 1D vs 2D grid used consistently. +- Minimal, surgical fixes while preserving function names and signatures. +- Final checklist: signatures exact; call sites 1:1; no placeholders left; numerically stable. +**Output format:** +Return ONLY a fenced reflection block with your analysis and proposed fixes. """ prompt_ga = """ -You are an expert in writing Triton operators for efficient GPU programming. -Analyze this Triton code and its performance(latency in ms and efficiency in TFLOPS or GB/s), and give a summary about the optimization strategy that the code uses. -Provide insights on how to generate a new code with better performance. -You can use optimization strategies such as Memory access efficiency, Hardware resource utilization, IR analysis, Assembly analysis, Kernel occupancy, -TorchInductor with Triton tuning knobs and Auto-tunable kernel configurations and environment variables. +You are an expert in writing Triton operators for efficient GPU programming. +Analyze this Triton code and its performance (latency in ms and efficiency in TFLOPS or GB/s). Summarize the current optimization strategy, +identify bottlenecks, and propose concrete steps to achieve better performance. **Original problem:** - {problem} - -**Triton code:** +**Triton code:** {code} **Test results:** - -latency: {latency}" - -efficiency(TFLOPS, GB/s): {efficiency} - -**Important Instructions:** -- Think before writing the optimization and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function and parameter names, counts, or order. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" - +latency (ms): {latency} +efficiency (TFLOPS / GB/s): {efficiency} + +**Reflection requirements (MUST address):** +- Current strategy: tiling/blocking, vectorization, memory access pattern (coalescing, reuse), use of shared/LDS, reduction scheme. +- Bottlenecks: occupancy (num_warps/num_stages), register pressure/spills, bank conflicts, uncoalesced loads/stores, synchronization overhead. +- Math precision: fp16/bf16 inputs with fp32 accumulation where needed; stability considerations. +- Concrete tuning plan: propose 6-12 autotune configs (BLOCK_* / num_warps 1..16 / num_stages 1..3) and expected trade-offs. +- Actionable changes: 3-5 prioritized edits (e.g., tile sizes, prefetching, pointer arithmetic, mask shaping, software pipelining). + +**Constraints:** +- Do NOT suggest changing function names or parameter lists/order. +- Output ONLY a fenced reflection block. """ prompt_rocm = """ -You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights -on why the solution failed and how it could be improved. Be specific about the issues found. +You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm. Analyze the failed test cases and explain +why the solution failed and how to improve it, focusing on ROCm compatibility and Triton best practices. **Original problem:** - {problem} **Attempted solution:** - {solution} -**Test results:** - +**Test results / error logs:** {test_result} -**Important Instructions:** -- Think before writing the reflection and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" - -Maximize performance by exploring the following: -i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. -ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. -iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints. -Primary Autotuning Fields (Mandatory) -1. BLOCK_M, BLOCK_N, BLOCK_K - * Tile sizes for GEMM or other tensor contractions. - * Larger blocks improve compute density, but reduce grid-level parallelism. - * Explore wide range of values like: - * BLOCK: [32, ..., 128, ..., 2048, ...] - * Adjust based on memory reuse and L2 cache locality. -2. num_stages=n - * Controls pipeline depth for kernel execution. - * Rules for setting this: - * 1 if no GEMM. - * 2 if a single GEMM (e.g., GEMM + ReLU). - * 1 if two GEMMs are fused (e.g., Flash Attention). - * Optimize for latency and execution overlap. -3. num_warps - * Controls number of warps (groups of 64 threads) to launch per block. - * If it is too low then underutilization -> kernel runs slow. - * If it is too high then register spill happens and shared memory is overused -> kernel runs slow. - * You must choose a sweet spot by trying out integer range of 1 to 16. - * You MUST NOT try the range beyond 16, it is NOT VALID. -Examples of Autotuning Setup -Here's how Triton kernels should be decorated to allow autotuning: - * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this. - * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time. - * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel. -For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances. - -NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc. - -def grid(args: dict[str, Any]) -> tuple[int]: - # This creates a 1D grid of size (C * D, ) - return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), ) - -The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid: -pid_m = tl.program_id(0) # Gets the ID for the first dimension -pid_n = tl.program_id(1) # Tries to get ID for a non-existent second dimension +**Reflection requirements (MUST address):** +- ROCm-specific pitfalls: any CUDA-only intrinsics/APIs, unsupported libdevice calls, wavefront-size assumptions. +- Grid/ID: ensure launch grid dimensionality matches tl.program_id usage; avoid reading non-existent dimensions. +- Triton semantics: tl.arange bounds as tl.constexpr; pointer arithmetic and masks rank alignment; out-of-bounds prevention. +- Numerics: fp32 accumulation for reductions; stability (log-sum-exp, eps). +- Concrete code fixes (no function renames), with brief line/section references. +- Final checklist: signatures exact; calls match defs; no undefined names; ROCm-compatible; masks correct. + +**Output format:** +Only return a fenced reflection block with your analysis and fixes. """ prompt_exe_rocm = """ -You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights -on why the solution failed and how it could be improved. Be specific about the issues found. -Runnable test is used to test if the code can be successfully executed. -Correctness test is used to test if the output of the code is correct, i.e. if the code does implement the functionality required in the original problem. +You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm. Analyze the failed tests and clearly separate +runnability issues from correctness issues. Provide precise fixes without renaming functions. -**Original problem:** +Runnable test verifies successful execution. Correctness test verifies functional equivalence. +**Original problem:** {problem} **Attempted solution:** - {solution} **Results for runnable test:** - {call_test_result} **Results for correctness test:** - {exe_test_result} -**Important Instructions:** -- Think before writing the reflection and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" - -Maximize performance by exploring the following: -i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. -ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. -iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints. -Primary Autotuning Fields (Mandatory) -1. BLOCK_M, BLOCK_N, BLOCK_K - * Tile sizes for GEMM or other tensor contractions. - * Larger blocks improve compute density, but reduce grid-level parallelism. - * Explore wide range of values like: - * BLOCK: [32, ..., 128, ..., 2048, ...] - * Adjust based on memory reuse and L2 cache locality. -2. num_stages=n - * Controls pipeline depth for kernel execution. - * Rules for setting this: - * 1 if no GEMM. - * 2 if a single GEMM (e.g., GEMM + ReLU). - * 1 if two GEMMs are fused (e.g., Flash Attention). - * Optimize for latency and execution overlap. -3. num_warps - * Controls number of warps (groups of 64 threads) to launch per block. - * If it is too low then underutilization -> kernel runs slow. - * If it is too high then register spill happens and shared memory is overused -> kernel runs slow. - * You must choose a sweet spot by trying out integer range of 1 to 16. - * You MUST NOT try the range beyond 16, it is NOT VALID. -Examples of Autotuning Setup -Here's how Triton kernels should be decorated to allow autotuning: - * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this. - * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time. - * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel. -For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances. - -NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc. - -def grid(args: dict[str, Any]) -> tuple[int]: - # This creates a 1D grid of size (C * D, ) - return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), ) - -The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid: -pid_m = tl.program_id(0) # Gets the ID for the first dimension -pid_n = tl.program_id(1) # Tries to get ID for a non-existent second dimension +**Reflection requirements (MUST address):** +- ROCm runnability: remove CUDA-only features; ensure tl.arange constexpr; fix grid vs program_id dimensionality; valid masks/pointers. +- Correctness: tile sizes vs shapes, stride math, dtype casts, fp32 accumulation, boundary handling, reduction associativity. +- Autotuning readiness: meta-params used in triton.Config must NOT be runtime kernel args; ensure compile-time tl.constexpr. +- Provide minimal edits (no function renames) and a quick validation checklist. + +**Output format:** +Respond ONLY with a fenced reflection block. """ prompt_ga_rocm = """ -You are an expert in writing Triton operators for efficient GPU programming. -Analyze this Triton code and its performance(speedup[vs reference kernel] for e.g. (1.6x) and efficiency in TFLOPS or GB/s), and give a summary about the optimization strategy that the code uses. -Provide insights on how to generate a new code with better performance. -You can use optimization strategies such as Memory access efficiency, Hardware resource utilization, IR analysis, Assembly analysis, Kernel occupancy, -TorchInductor with Triton tuning knobs and Auto-tunable kernel configurations and environment variables. +You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm. +Explain the code performance (speedup vs reference and TFLOPS/GB/s), the optimization strategy used, and how to improve it further on ROCm. **Original problem:** - {problem} - -**Triton code:** +**Triton code:** {code} **Test results:** +Speedup (x): {latency} +efficiency (TFLOPS / GB/s): {efficiency} + +**Reflection requirements (MUST address):** +- Current strategy and ROCm fit: tiling/block sizes vs wavefront 64, LDS usage and padding, vectorized IO, prefetching. +- Bottlenecks: occupancy (num_warps/num_stages), register pressure/spills, memory divergence, bank conflicts, synchronization cost. +- Autotune proposal: 6-12 configs over BLOCK_M/N/K (or BLOCK_SIZE), num_warps in [1..16], num_stages in {1,2,3}; justify ranges for ROCm. +- Concrete steps: 3-5 prioritized edits with expected impact (e.g., better coalescing, tile re-shaping, software pipelining, on-the-fly dequant). +- Keep function signatures and parameter lists/order unchanged. + +**Output format:** +ONLY return a fenced reflection block. +""" -Speedup: {latency}" - -efficiency(TFLOPS, GB/s): {efficiency} - -**Important Instructions:** -- Think before writing the optimization and no more explanation is required after the reflection. -- You should not suggest changes to the name of the function and parameter names, counts, or order. -- generate the reflection wrapped in a code block with the tag `reflection`, e.g. -"```markdown```" - -Maximize performance by exploring the following: -i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. -ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. -iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints. -Primary Autotuning Fields (Mandatory) -1. BLOCK_M, BLOCK_N, BLOCK_K - * Tile sizes for GEMM or other tensor contractions. - * Larger blocks improve compute density, but reduce grid-level parallelism. - * Explore wide range of values like: - * BLOCK: [32, ..., 128, ..., 2048, ...] - * Adjust based on memory reuse and L2 cache locality. -2. num_stages=n - * Controls pipeline depth for kernel execution. - * Rules for setting this: - * 1 if no GEMM. - * 2 if a single GEMM (e.g., GEMM + ReLU). - * 1 if two GEMMs are fused (e.g., Flash Attention). - * Optimize for latency and execution overlap. -3. num_warps - * Controls number of warps (groups of 64 threads) to launch per block. - * If it is too low then underutilization -> kernel runs slow. - * If it is too high then register spill happens and shared memory is overused -> kernel runs slow. - * You must choose a sweet spot by trying out integer range of 1 to 16. - * You MUST NOT try the range beyond 16, it is NOT VALID. -Examples of Autotuning Setup -Here's how Triton kernels should be decorated to allow autotuning: - * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this. - * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time. - * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel. -For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances. - -NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc. - -def grid(args: dict[str, Any]) -> tuple[int]: - # This creates a 1D grid of size (C * D, ) - return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), ) - -The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid: -pid_m = tl.program_id(0) # Gets the ID for the first dimension -pid_n = tl.program_id(1) # Tries to get ID for a non-existent second dimension -""" \ No newline at end of file