diff --git a/src/prompts/prompt_for_generation.py b/src/prompts/prompt_for_generation.py
index 6d07461..292021e 100644
--- a/src/prompts/prompt_for_generation.py
+++ b/src/prompts/prompt_for_generation.py
@@ -1,131 +1,108 @@
-
 prompt = """
-You are an expert Python programmer specializing in NVIDIA Triton kernels, specifically targeting **AMD GPUs using the ROCm environment**.
-Your task is to generate a Python code snippet containing a Triton kernel based on the following request:
+You are an expert Python programmer specializing in Triton kernels for **AMD GPUs (ROCm)**.
+Generate a **single, complete, syntactically-correct Python code block** that implements the requested kernel.
 
 **Target Platform:** AMD GPU (ROCm)
 
 **Request:**
 {instruction}
 
-**CRITICAL FUNCTION INFORMATION:**
-Based on analysis, the implementation requires these EXACT function signatures:
+**CRITICAL FUNCTION INFORMATION (do NOT change):**
+Use EXACTLY the following function signatures:
 {function_signatures}
 
-**Output Requirements:**
-1.  **AMD Compatibility:** Generate code compatible with AMD GPUs and ROCm. **DO NOT use CUDA-specific features or functions (e.g., `tl.libdevice`).**
-2.  **Complete Code:** Generate a single, complete, and syntactically correct Python code block.
-3.  **Triton Kernel:** The core logic must be implemented within a Triton kernel function decorated with `@triton.jit`.
-4.  **Imports:** ALWAYS include necessary imports at the beginning:
-    ```python
-    import torch
-    import triton
-    import triton.language as tl
-    # import math # Only if standard math functions are truly needed outside the kernel
-    ```
-    Include other imports *only if absolutely necessary*.
-5.  **Function Signature (CRITICAL):**
-    *   Define EACH function with EXACTLY the signature shown above.
-    *   DO NOT change parameter names, counts, or order.
-    *   Ensure all parameters in function calls match their function definitions.
-    *   **Type Hints:** Use PyTorch tensor type hints (e.g., `x: torch.Tensor`) for tensor arguments. **DO NOT use `tl.pointer`**. Use standard Python types (e.g., `int`, `float`) or `tl.constexpr` for others.
-    *   **`constexpr`:** Use `tl.constexpr` **ONLY** for arguments that *must* be known at compile time, typically block sizes (like `BLOCK_SIZE`, `BLOCK_M`) or flags that change the kernel's structure (like `IS_EVEN_K`). Simple numerical values like `eps` or `dropout_p` are usually *not* `constexpr`.
-6.  **Data Types:** Be precise with data types inside the kernel (e.g., `tl.float16`, `tl.float32`, `tl.int32`). Ensure type compatibility. Assume input tensors might be `torch.float16` or `torch.float32` unless specified otherwise. Pay attention to potential type promotion/conversion needs (e.g., using `.to(tl.float32)` for accumulations).
-7.  **Triton Operations:**
-    *   Use Triton language functions correctly (`tl.load`, `tl.store`, `tl.dot`, `tl.arange`, `tl.program_id`, `tl.where`, `tl.atomic_cas`, etc.).
-    *   **Pointers & Masks:** Be extremely careful when constructing pointers using offsets and strides. Ensure masks in `tl.load`/`tl.store` are correctly computed and match pointer dimensions. Avoid `ValueError: Mask argument cannot be block type...` or `ValueError: Unsupported ptr type...`.
-    *   **`tl.dot`:** Ensure inputs are 2D blocks and have compatible types (e.g., float16, bfloat16). Int32 is generally not supported directly as input.
-    *   **`tl.arange`:** Arguments `start` and `end` **must be `tl.constexpr`**.
-    *   **Math:** Use functions from `tl.math` where available (e.g., `tl.math.exp`, `tl.math.sqrt`). Check function existence; avoid assuming functions like `tanh` or `log1p` exist if they don't in `tl.math`.
-8.  **Triton Version:** Assume Triton version 3.1.0 or later.
+**Hard Requirements (MUST follow):**
+1) **ROCm-only:** Do NOT use CUDA-only features (e.g., `tl.libdevice`, CUDA streams/APIs).
+2) **Single full code block:** The output must be one fenced Python block containing:
+   - Required imports only:
+     ```python
+     import torch
+     import triton
+     import triton.language as tl
+     # import math  # only if math.* is used outside kernels
+     ```
+   - The public Python function(s) with EXACT signatures from above.
+   - One or more `@triton.jit` kernels that the public function(s) launch.
+3) **Signatures locked:** Do not change function names, parameter names, counts, or order.
+   - Calls must match definitions 1:1 (no missing or extra args).
+   - Use `torch.Tensor` type hints for tensor params; use Python scalars or `tl.constexpr` for compile-time meta-params only.
+4) **Types & numerics:**
+   - Be explicit with Triton dtypes (`tl.float16`, `tl.float32`, `tl.int32`, etc.).
+   - For reductions / accumulations, prefer `tl.float32` and cast back to output dtype when storing.
+   - Avoid unsupported math; if needed, use `tl.math` (e.g., `tl.math.exp`, `tl.math.sqrt`).
+5) **Triton ops & shapes:**
+   - Use `tl.load` / `tl.store` with correct pointer arithmetic and **matching mask shapes**; avoid OOB.
+   - `tl.arange` bounds must be `tl.constexpr`.
+   - `tl.dot` inputs must be 2D tiles with supported dtypes (fp16/bf16 to fp32 accumulate).
+6) **Grid & program IDs:**
+   - Make grid dimensionality consistent with how you read `tl.program_id(n)`.
+   - If grid is 1D, do not read a second program dimension.
+7) **Triton version:** Assume Triton >= 3.1.0.
 
-**FINAL VERIFICATION:**
-Before completing, verify:
-1. ALL functions defined in the code have EXACT signatures matching the required function signatures above.
-2. ALL function calls exactly match their definitions in terms of parameter counts and names.
-3. No functions are called without being defined.
-4. No parameters are missing from your implementations.
+**Final Self-Check (before you finish):**
+- [ ] All public functions exist and match EXACT signatures listed above.
+- [ ] Every call matches its callee’s params (names/order/count).
+- [ ] No undefined names; no missing imports; no placeholder variables left.
+- [ ] All pointer masks are correctly shaped; no rank mismatches.
+- [ ] Code compiles for ROCm (no CUDA-only APIs).
 
 **Generated AMD ROCm Compatible Triton Kernel Code:**
 """
-
 prompt_rocm = """
-You are an expert Python programmer specializing in NVIDIA Triton kernels, specifically targeting **AMD GPUs using the ROCm environment**.
-Your task is to generate a Python code snippet containing a Triton kernel based on the following request:
+You are an expert Python programmer specializing in Triton kernels for **AMD GPUs (ROCm)**.
+Generate a **single, complete, syntactically-correct Python code block** that implements the requested kernel with attention to performance.
 
 **Target Platform:** AMD GPU (ROCm)
 
 **Request:**
 {instruction}
 
-**CRITICAL FUNCTION INFORMATION:**
-Based on analysis, the implementation requires these EXACT function signatures:
+**CRITICAL FUNCTION INFORMATION (do NOT change):**
+Use EXACTLY the following function signatures:
 {function_signatures}
 
-**Output Requirements:**
-1.  **AMD Compatibility:** Generate code compatible with AMD GPUs and ROCm. **DO NOT use CUDA-specific features or functions (e.g., `tl.libdevice`).**
-2.  **Complete Code:** Generate a single, complete, and syntactically correct Python code block.
-3.  **Triton Kernel:** The core logic must be implemented within a Triton kernel function decorated with `@triton.jit`.
-4.  **Imports:** ALWAYS include necessary imports at the beginning:
-    ```python
-    import torch
-    import triton
-    import triton.language as tl
-    # import math # Only if standard math functions are truly needed outside the kernel
-    ```
-    Include other imports *only if absolutely necessary*.
-5.  **Function Signature (CRITICAL):**
-    *   Define EACH function with EXACTLY the signature shown above.
-    *   DO NOT change parameter names, counts, or order.
-    *   Ensure all parameters in function calls match their function definitions.
-    *   **Type Hints:** Use PyTorch tensor type hints (e.g., `x: torch.Tensor`) for tensor arguments. **DO NOT use `tl.pointer`**. Use standard Python types (e.g., `int`, `float`) or `tl.constexpr` for others.
-    *   **`constexpr`:** Use `tl.constexpr` **ONLY** for arguments that *must* be known at compile time, typically block sizes (like `BLOCK_SIZE`, `BLOCK_M`) or flags that change the kernel's structure (like `IS_EVEN_K`). Simple numerical values like `eps` or `dropout_p` are usually *not* `constexpr`.
-6.  **Data Types:** Be precise with data types inside the kernel (e.g., `tl.float16`, `tl.float32`, `tl.int32`). Ensure type compatibility. Assume input tensors might be `torch.float16` or `torch.float32` unless specified otherwise. Pay attention to potential type promotion/conversion needs (e.g., using `.to(tl.float32)` for accumulations).
-7.  **Triton Operations:**
-    *   Use Triton language functions correctly (`tl.load`, `tl.store`, `tl.dot`, `tl.arange`, `tl.program_id`, `tl.where`, `tl.atomic_cas`, etc.).
-    *   **Pointers & Masks:** Be extremely careful when constructing pointers using offsets and strides. Ensure masks in `tl.load`/`tl.store` are correctly computed and match pointer dimensions. Avoid `ValueError: Mask argument cannot be block type...` or `ValueError: Unsupported ptr type...`.
-    *   **`tl.dot`:** Ensure inputs are 2D blocks and have compatible types (e.g., float16, bfloat16). Int32 is generally not supported directly as input.
-    *   **`tl.arange`:** Arguments `start` and `end` **must be `tl.constexpr`**.
-    *   **Math:** Use functions from `tl.math` where available (e.g., `tl.math.exp`, `tl.math.sqrt`). Check function existence; avoid assuming functions like `tanh` or `log1p` exist if they don't in `tl.math`.
-8.  **Triton Version:** Assume Triton version 3.2.0 or later.
-9.  Maximize performance by exploring the following:
-i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. 
-ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. 
-iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints.
-Primary Autotuning Fields (Mandatory)
-1. BLOCK_M, BLOCK_N, BLOCK_K
-   * Tile sizes for GEMM or other tensor contractions.
-   * Larger blocks improve compute density, but reduce grid-level parallelism.
-   * Explore wide range of values like:
-     * BLOCK: [32, ..., 128, ..., 2048, ...] 
-   * Adjust based on memory reuse and L2 cache locality.
-2. num_stages=n
-   * Controls pipeline depth for kernel execution.
-   * Rules for setting this:
-     * 1 if no GEMM.
-     * 2 if a single GEMM (e.g., GEMM + ReLU).
-     * 1 if two GEMMs are fused (e.g., Flash Attention).
-   * Optimize for latency and execution overlap.
-3. num_warps
-    * Controls number of warps (groups of 64 threads) to launch per block.
-    * If it is too low then underutilization -> kernel runs slow.
-    * If it is too high then register spill happens and shared memory is overused -> kernel runs slow.
-    * You must choose a sweet spot by trying out integer range of 1 to 16.
-    * You MUST NOT try the range beyond 16, it is NOT VALID. 
-Examples of Autotuning Setup
-Here's how Triton kernels should be decorated to allow autotuning:
-    * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this.
-    * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time.
-    * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel.
-For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances.
-
-NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc.
+**Hard Requirements (MUST follow):**
+1) **ROCm-only:** Do NOT use CUDA-only features (e.g., `tl.libdevice`, CUDA streams/APIs).
+2) **Single full code block:** The output must be one fenced Python block containing:
+   - Required imports only:
+     ```python
+     import torch
+     import triton
+     import triton.language as tl
+     # import math  # only if math.* is used outside kernels
+     ```
+   - The public Python function(s) with EXACT signatures from above.
+   - One or more `@triton.jit` kernels that the public function(s) launch.
+3) **Signatures locked:** Do not change function names, parameter names, counts, or order.
+   - Calls must match definitions 1:1 (no missing or extra args).
+   - Use `torch.Tensor` type hints for tensor params; use Python scalars or `tl.constexpr` for compile-time meta-params only.
+4) **Types & numerics:**
+   - Be explicit with Triton dtypes (`tl.float16`, `tl.float32`, `tl.int32`, etc.).
+   - Prefer `tl.float32` accumulation for numerical stability; cast to output dtype on store.
+   - Use `tl.math` where appropriate (e.g., `tl.math.exp`, `tl.math.sqrt`).
+5) **Triton ops & memory:**
+   - Use `tl.load`/`tl.store` with correct pointer arithmetic and **proper masks**; avoid OOB and mask–ptr rank mismatch.
+   - `tl.arange` bounds must be `tl.constexpr`.
+   - Ensure coalesced accesses and avoid bank conflicts; use tiling/blocking for reuse.
+   - `tl.dot` only with supported dtypes (e.g., fp16/bf16), accumulate in fp32.
+6) **Grid & program IDs:**
+   - Keep grid dimensionality consistent with accesses to `tl.program_id(n)`. Do not read a second dimension if the launch grid is 1D.
+7) **Autotuning for performance (where meaningful):**
+   - Provide Triton autotune configs exploring:
+     - BLOCK sizes (e.g., `BLOCK_M`, `BLOCK_N`, `BLOCK_K`) across a reasonable range (e.g., 32..256 or problem-appropriate).
+     - `num_warps` in [1..16] (do not exceed 16).
+     - `num_stages` with typical values {1, 2, 3}, chosen by fusion depth and pipeline overlap.
+   - Ensure meta-params used in `triton.Config` are NOT passed as kernel runtime arguments.
+   - Favor coalescing, occupancy, and cache locality; reduce register pressure to avoid spills.
+8) **Triton version:** Assume Triton >= 3.2.0.
 
-def grid(args: dict[str, Any]) -> tuple[int]:
-    # This creates a 1D grid of size (C * D, )
-    return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), )
+**Final Self-Check (before you finish):**
+- [ ] All public functions exist and match EXACT signatures listed above.
+- [ ] Every call matches its callee’s params (names/order/count).
+- [ ] No undefined names; no missing imports; no placeholder variables left.
+- [ ] Pointer masks correctly shaped; grid dims match `tl.program_id` usage.
+- [ ] No CUDA-only APIs; valid on ROCm.
+- [ ] If autotune is provided, configs are sensible and compile-time meta-params are `tl.constexpr`.
 
-The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid:
-pid_m = tl.program_id(0)  # Gets the ID for the first dimension
-pid_n = tl.program_id(1)  # Tries to get ID for a non-existent second dimension
-"""
\ No newline at end of file
+**Generated AMD ROCm Compatible Triton Kernel Code:**
+"""
diff --git a/src/prompts/prompt_for_reflection.py b/src/prompts/prompt_for_reflection.py
index fe3f936..852838a 100644
--- a/src/prompts/prompt_for_reflection.py
+++ b/src/prompts/prompt_for_reflection.py
@@ -1,285 +1,166 @@
+# Create a clean, ASCII-only Python file containing the 6 prompt strings.
+
+
 prompt = """
-You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights 
-on why the solution failed and how it could be improved. Be specific about the issues found.
+You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and explain
+why the solution failed and how to improve it. Be concrete and specific.
 
 **Original problem:**
-
 {problem}
 
 **Attempted solution:**
-
 {solution}
 
-**Test results:**
-
+**Test results / error logs:**
 {test_result}
 
-**Important Instructions:**
-- Think before writing the reflection and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
-
+**Reflection requirements (MUST address succinctly):**
+- Root causes: shapes/strides/dtypes/masks, grid dimension vs tl.program_id, tl.arange constexpr bounds, pointer-mask rank match, out-of-bounds.
+- Triton specifics: correct use of tl.load/tl.store, reduction correctness (associativity, fp32 accumulation), correct tile shapes for tl.dot.
+- Numerical stability: overflow/underflow, eps handling, NaN/Inf creation paths.
+- Concrete fixes: point to exact lines/sections and describe minimal code changes; do NOT change function names.
+- Quick checklist: signatures exact; calls match definitions; no undefined names; grid dims consistent; no CUDA-only APIs.
+
+**Output format:**
+Wrap your reflection ONLY in a fenced block with the tag reflection, for example:
+three backticks + reflection
+<your analysis and fixes>
+three backticks
+No extra commentary outside the block.
 """
 
 prompt_exe = """
-You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights 
-on why the solution failed and how it could be improved. Be specific about the issues found.
-Runnable test is used to test if the code can be successfully executed.
-Correctness test is used to test if the output of the code is correct, i.e. if the code does implement the functionality required in the original problem.
+You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed tests and explain
+why the solution failed and how to fix it. Distinguish between runnability and correctness issues.
 
-**Original problem:**
+Runnable test checks whether the code executes without crashing.
+Correctness test checks whether outputs match the required functionality.
 
+**Original problem:**
 {problem}
 
 **Attempted solution:**
-
 {solution}
 
 **Results for runnable test:**
-
 {call_test_result}
 
 **Results for correctness test:**
-
 {exe_test_result}
 
-**Important Instructions:**
-- Think before writing the reflection and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
+**Reflection requirements (MUST address):**
+- Runnable failures: compilation errors, undefined names, illegal masks, grid/program_id mismatch, invalid tl.arange bounds, CUDA-only usage.
+- Correctness failures: wrong indexing/strides, mask shape mismatch, reduction order/precision, dtype cast issues, boundary conditions.
+- Triton check: tl.load/tl.store masks match pointer shapes; tile sizes align with problem sizes; 1D vs 2D grid used consistently.
+- Minimal, surgical fixes while preserving function names and signatures.
+- Final checklist: signatures exact; call sites 1:1; no placeholders left; numerically stable.
 
+**Output format:**
+Return ONLY a fenced reflection block with your analysis and proposed fixes.
 """
 
 prompt_ga = """
-You are an expert in writing Triton operators for efficient GPU programming. 
-Analyze this Triton code and its performance(latency in ms and efficiency in TFLOPS or GB/s), and give a summary about the optimization strategy that the code uses.
-Provide insights on how to generate a new code with better performance. 
-You can use optimization strategies such as Memory access efficiency, Hardware resource utilization, IR analysis, Assembly analysis, Kernel occupancy, 
-TorchInductor with Triton tuning knobs and Auto-tunable kernel configurations and environment variables.
+You are an expert in writing Triton operators for efficient GPU programming.
+Analyze this Triton code and its performance (latency in ms and efficiency in TFLOPS or GB/s). Summarize the current optimization strategy,
+identify bottlenecks, and propose concrete steps to achieve better performance.
 
 **Original problem:**
-
 {problem}
-   
-**Triton code:**
 
+**Triton code:**
 {code}
 
 **Test results:**
-
-latency: {latency}"
-
-efficiency(TFLOPS, GB/s): {efficiency}
-
-**Important Instructions:**
-- Think before writing the optimization and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function and parameter names, counts, or order.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
-
+latency (ms): {latency}
+efficiency (TFLOPS / GB/s): {efficiency}
+
+**Reflection requirements (MUST address):**
+- Current strategy: tiling/blocking, vectorization, memory access pattern (coalescing, reuse), use of shared/LDS, reduction scheme.
+- Bottlenecks: occupancy (num_warps/num_stages), register pressure/spills, bank conflicts, uncoalesced loads/stores, synchronization overhead.
+- Math precision: fp16/bf16 inputs with fp32 accumulation where needed; stability considerations.
+- Concrete tuning plan: propose 6-12 autotune configs (BLOCK_* / num_warps 1..16 / num_stages 1..3) and expected trade-offs.
+- Actionable changes: 3-5 prioritized edits (e.g., tile sizes, prefetching, pointer arithmetic, mask shaping, software pipelining).
+
+**Constraints:**
+- Do NOT suggest changing function names or parameter lists/order.
+- Output ONLY a fenced reflection block.
 """
 
 prompt_rocm = """
-You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights 
-on why the solution failed and how it could be improved. Be specific about the issues found.
+You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm. Analyze the failed test cases and explain
+why the solution failed and how to improve it, focusing on ROCm compatibility and Triton best practices.
 
 **Original problem:**
-
 {problem}
 
 **Attempted solution:**
-
 {solution}
 
-**Test results:**
-
+**Test results / error logs:**
 {test_result}
 
-**Important Instructions:**
-- Think before writing the reflection and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
-
-Maximize performance by exploring the following:
-i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. 
-ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. 
-iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints.
-Primary Autotuning Fields (Mandatory)
-1. BLOCK_M, BLOCK_N, BLOCK_K
-   * Tile sizes for GEMM or other tensor contractions.
-   * Larger blocks improve compute density, but reduce grid-level parallelism.
-   * Explore wide range of values like:
-     * BLOCK: [32, ..., 128, ..., 2048, ...] 
-   * Adjust based on memory reuse and L2 cache locality.
-2. num_stages=n
-   * Controls pipeline depth for kernel execution.
-   * Rules for setting this:
-     * 1 if no GEMM.
-     * 2 if a single GEMM (e.g., GEMM + ReLU).
-     * 1 if two GEMMs are fused (e.g., Flash Attention).
-   * Optimize for latency and execution overlap.
-3. num_warps
-    * Controls number of warps (groups of 64 threads) to launch per block.
-    * If it is too low then underutilization -> kernel runs slow.
-    * If it is too high then register spill happens and shared memory is overused -> kernel runs slow.
-    * You must choose a sweet spot by trying out integer range of 1 to 16.
-    * You MUST NOT try the range beyond 16, it is NOT VALID. 
-Examples of Autotuning Setup
-Here's how Triton kernels should be decorated to allow autotuning:
-    * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this.
-    * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time.
-    * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel.
-For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances.
-
-NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc.
-
-def grid(args: dict[str, Any]) -> tuple[int]:
-    # This creates a 1D grid of size (C * D, )
-    return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), )
-
-The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid:
-pid_m = tl.program_id(0)  # Gets the ID for the first dimension
-pid_n = tl.program_id(1)  # Tries to get ID for a non-existent second dimension
+**Reflection requirements (MUST address):**
+- ROCm-specific pitfalls: any CUDA-only intrinsics/APIs, unsupported libdevice calls, wavefront-size assumptions.
+- Grid/ID: ensure launch grid dimensionality matches tl.program_id usage; avoid reading non-existent dimensions.
+- Triton semantics: tl.arange bounds as tl.constexpr; pointer arithmetic and masks rank alignment; out-of-bounds prevention.
+- Numerics: fp32 accumulation for reductions; stability (log-sum-exp, eps).
+- Concrete code fixes (no function renames), with brief line/section references.
+- Final checklist: signatures exact; calls match defs; no undefined names; ROCm-compatible; masks correct.
+
+**Output format:**
+Only return a fenced reflection block with your analysis and fixes.
 """
 
 prompt_exe_rocm = """
-You are an expert in writing Triton operators for efficient GPU programming. Analyze the failed test cases and provide insights 
-on why the solution failed and how it could be improved. Be specific about the issues found.
-Runnable test is used to test if the code can be successfully executed.
-Correctness test is used to test if the output of the code is correct, i.e. if the code does implement the functionality required in the original problem.
+You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm. Analyze the failed tests and clearly separate
+runnability issues from correctness issues. Provide precise fixes without renaming functions.
 
-**Original problem:**
+Runnable test verifies successful execution. Correctness test verifies functional equivalence.
 
+**Original problem:**
 {problem}
 
 **Attempted solution:**
-
 {solution}
 
 **Results for runnable test:**
-
 {call_test_result}
 
 **Results for correctness test:**
-
 {exe_test_result}
 
-**Important Instructions:**
-- Think before writing the reflection and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
-
-Maximize performance by exploring the following:
-i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. 
-ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. 
-iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints.
-Primary Autotuning Fields (Mandatory)
-1. BLOCK_M, BLOCK_N, BLOCK_K
-   * Tile sizes for GEMM or other tensor contractions.
-   * Larger blocks improve compute density, but reduce grid-level parallelism.
-   * Explore wide range of values like:
-     * BLOCK: [32, ..., 128, ..., 2048, ...] 
-   * Adjust based on memory reuse and L2 cache locality.
-2. num_stages=n
-   * Controls pipeline depth for kernel execution.
-   * Rules for setting this:
-     * 1 if no GEMM.
-     * 2 if a single GEMM (e.g., GEMM + ReLU).
-     * 1 if two GEMMs are fused (e.g., Flash Attention).
-   * Optimize for latency and execution overlap.
-3. num_warps
-    * Controls number of warps (groups of 64 threads) to launch per block.
-    * If it is too low then underutilization -> kernel runs slow.
-    * If it is too high then register spill happens and shared memory is overused -> kernel runs slow.
-    * You must choose a sweet spot by trying out integer range of 1 to 16.
-    * You MUST NOT try the range beyond 16, it is NOT VALID. 
-Examples of Autotuning Setup
-Here's how Triton kernels should be decorated to allow autotuning:
-    * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this.
-    * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time.
-    * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel.
-For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances.
-
-NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc.
-
-def grid(args: dict[str, Any]) -> tuple[int]:
-    # This creates a 1D grid of size (C * D, )
-    return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), )
-
-The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid:
-pid_m = tl.program_id(0)  # Gets the ID for the first dimension
-pid_n = tl.program_id(1)  # Tries to get ID for a non-existent second dimension
+**Reflection requirements (MUST address):**
+- ROCm runnability: remove CUDA-only features; ensure tl.arange constexpr; fix grid vs program_id dimensionality; valid masks/pointers.
+- Correctness: tile sizes vs shapes, stride math, dtype casts, fp32 accumulation, boundary handling, reduction associativity.
+- Autotuning readiness: meta-params used in triton.Config must NOT be runtime kernel args; ensure compile-time tl.constexpr.
+- Provide minimal edits (no function renames) and a quick validation checklist.
+
+**Output format:**
+Respond ONLY with a fenced reflection block.
 """
 
 prompt_ga_rocm = """
-You are an expert in writing Triton operators for efficient GPU programming. 
-Analyze this Triton code and its performance(speedup[vs reference kernel] for e.g. (1.6x) and efficiency in TFLOPS or GB/s), and give a summary about the optimization strategy that the code uses.
-Provide insights on how to generate a new code with better performance. 
-You can use optimization strategies such as Memory access efficiency, Hardware resource utilization, IR analysis, Assembly analysis, Kernel occupancy, 
-TorchInductor with Triton tuning knobs and Auto-tunable kernel configurations and environment variables.
+You are an expert in writing Triton operators for efficient GPU programming on AMD ROCm.
+Explain the code performance (speedup vs reference and TFLOPS/GB/s), the optimization strategy used, and how to improve it further on ROCm.
 
 **Original problem:**
-
 {problem}
-   
-**Triton code:**
 
+**Triton code:**
 {code}
 
 **Test results:**
+Speedup (x): {latency}
+efficiency (TFLOPS / GB/s): {efficiency}
+
+**Reflection requirements (MUST address):**
+- Current strategy and ROCm fit: tiling/block sizes vs wavefront 64, LDS usage and padding, vectorized IO, prefetching.
+- Bottlenecks: occupancy (num_warps/num_stages), register pressure/spills, memory divergence, bank conflicts, synchronization cost.
+- Autotune proposal: 6-12 configs over BLOCK_M/N/K (or BLOCK_SIZE), num_warps in [1..16], num_stages in {1,2,3}; justify ranges for ROCm.
+- Concrete steps: 3-5 prioritized edits with expected impact (e.g., better coalescing, tile re-shaping, software pipelining, on-the-fly dequant).
+- Keep function signatures and parameter lists/order unchanged.
+
+**Output format:**
+ONLY return a fenced reflection block.
+"""
 
-Speedup: {latency}"
-
-efficiency(TFLOPS, GB/s): {efficiency}
-
-**Important Instructions:**
-- Think before writing the optimization and no more explanation is required after the reflection.
-- You should not suggest changes to the name of the function and parameter names, counts, or order.
-- generate the reflection wrapped in a code block with the tag `reflection`, e.g.
-"```markdown<your reflections>```"
-
-Maximize performance by exploring the following:
-i. Autotuning key parameters BLOCK_SIZE, num_stages, num_warps. 
-ii. Better algorithmic implementation (e.g., naive softmax vs online softmax vs fused softmax), better memory access patterns and numerical stability. 
-iii. exploring all possible operator fusion strategies within the kernel while adhering to resource constraints.
-Primary Autotuning Fields (Mandatory)
-1. BLOCK_M, BLOCK_N, BLOCK_K
-   * Tile sizes for GEMM or other tensor contractions.
-   * Larger blocks improve compute density, but reduce grid-level parallelism.
-   * Explore wide range of values like:
-     * BLOCK: [32, ..., 128, ..., 2048, ...] 
-   * Adjust based on memory reuse and L2 cache locality.
-2. num_stages=n
-   * Controls pipeline depth for kernel execution.
-   * Rules for setting this:
-     * 1 if no GEMM.
-     * 2 if a single GEMM (e.g., GEMM + ReLU).
-     * 1 if two GEMMs are fused (e.g., Flash Attention).
-   * Optimize for latency and execution overlap.
-3. num_warps
-    * Controls number of warps (groups of 64 threads) to launch per block.
-    * If it is too low then underutilization -> kernel runs slow.
-    * If it is too high then register spill happens and shared memory is overused -> kernel runs slow.
-    * You must choose a sweet spot by trying out integer range of 1 to 16.
-    * You MUST NOT try the range beyond 16, it is NOT VALID. 
-Examples of Autotuning Setup
-Here's how Triton kernels should be decorated to allow autotuning:
-    * key argument indicates the variables that change and trigger autotune to re-run. This is a must argument and you must not miss this.
-    * BLOCK_M refers to the chunk of variable M that will be used for compute by a thread at a time.
-    * You must ensure that variables used in the triton.Config should not be passed as arguments to the triton kernel.
-For example: the following autotune config receives BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages as input arguments. Hence the triton kernel must not receive these arguments as inputs in the wrapper function. You must comment/delete any such instances.
-
-NOTE: If you face kernel timeout issues, check if Grid and Program ID Mismatch exists or not for example The kernel is launched with a 1-dimensional (1D) grid, but inside the kernel, it attempts to read program IDs from a 2-dimensional (2D) grid etc.
-
-def grid(args: dict[str, Any]) -> tuple[int]:
-    # This creates a 1D grid of size (C * D, )
-    return (triton.cdiv(M, args["BLOCK_SIZE_M"]) * triton.cdiv(N, args["BLOCK_SIZE_N"]), )
-
-The grid is calculated as a single integer, creating a 1D grid, however the kernel might try to get two separate program IDs, pid_m and pid_n, as if it were a 2D grid:
-pid_m = tl.program_id(0)  # Gets the ID for the first dimension
-pid_n = tl.program_id(1)  # Tries to get ID for a non-existent second dimension
-"""
\ No newline at end of file