From 697e7addcf4c1ff5679b84f65daf7e8b3fd16a7a Mon Sep 17 00:00:00 2001 From: Farnaz Kohankhaki Date: Fri, 6 Jun 2025 00:24:33 -0700 Subject: [PATCH 1/5] added trimat_forward cuda kernel. --- .../figures/trimat_forward/kernel1.svg | 4 + .../figures/trimat_forward/kernel2.svg | 4 + .../figures/trimat_forward/kernel4.svg | 4 + .../src/cuda/kernels/trimat_forward.md | 146 +++++++++++++++++- 4 files changed, 156 insertions(+), 2 deletions(-) create mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg create mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg create mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg new file mode 100644 index 0000000..1138b3b --- /dev/null +++ b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg @@ -0,0 +1,4 @@ + + +TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0B diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg new file mode 100644 index 0000000..ec49bd9 --- /dev/null +++ b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg @@ -0,0 +1,4 @@ + + +TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0Blhs[8], rhs[8], val[8][8]lhs[8], rhs[8], val[8][8]lhs[8], rhs[8], val[8][8]registerregisterregister diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg new file mode 100644 index 0000000..81acfe6 --- /dev/null +++ b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg @@ -0,0 +1,4 @@ + + +TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t1,1t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0Bshared memory - K0,00,1...0,160,17...0,300,311,08,0.........126,0127,0shared memory - Q0,00,1...0,160,17...0,300,311,08,0......126,0127,0...shared memory - Q0,00,1...0,160,17...0,300,311,08,0......126,0127,0...shared memory - K0,00,1...0,80,9...0,300,311,08,0......126,0127,0...loadcompute* diff --git a/books/compute/src/cuda/kernels/trimat_forward.md b/books/compute/src/cuda/kernels/trimat_forward.md index 0fbc7aa..55a0b14 100644 --- a/books/compute/src/cuda/kernels/trimat_forward.md +++ b/books/compute/src/cuda/kernels/trimat_forward.md @@ -1,11 +1,153 @@ -# Kernels for triangular matmul forward pass + + +# Kernels for Triangular Matrix Multiplication (Trimat) Forward Pass + {{ #aipr_header }} +## Introduction + +This guide demonstrates efficient GPU implementations of **triangular matrix multiplication**, as used in **causal self-attention** in autoregressive transformer models. +For **causal (autoregressive) attention**, we only need the **lower triangle** of the attention matrix; that is, each token should only attend to current and previous tokens. + +Computing the full matrix is wasteful when only the lower triangle is needed. +Triangular matrix multiplication is a specialized form of matrix multiplication where, instead of computing the full output matrix, only the **lower triangle** is computed. This leads to substantial computational savings. + +This guide explains a series of CUDA kernel implementations for the **Trimat forward pass**, based on the [llm.c](https://github.com/karpathy/llm.c/tree/master/dev/cuda) GitHub repository. +These kernels avoid unnecessary computation and offer potential speedups over cuBLAS. These kernels are introduced in increasing order of optimization: + +- [Kernel 1: `matmul_tri_naive`](#kernel-1-naive-implementation-matmul_tri_naive): a simple nested loop implementation with no memory optimization. +- [Kernel 2: `matmul_tri_registers`](#kernel-2-register-tiling): uses **register tiling** to reduce redundant memory loads. +- [Kernel 3: `matmul_tri3`](#kernel-3-vectorized-loads): adds **vectorized memory access** using `float4` to improve memory coalescing. +- [Kernel 4: `matmul_tri4`](#kernel-4-shared-memory-tiling-matmul_tri4): leverages **shared memory** tiling for inter-thread data reuse and further performance gains. + +The next section, [Input, Output, and Computation](#input-output-and-computation), describes the tensor shapes, the configuration used for benchmarking, and the exact computation performed during the Trimat forward pass. + +## Input, Output, and Computation + +This section describes the structure of the input/output tensors and the computation performed by the Trimat kernels. + +### Input Tensor + +The input tensor packs queries and keys (and values, though unused here) in the shape: + +(B, T, 3, NH, HS) + +Where: + +- B: Batch size +- T: Sequence length +- 3: Stacked Query, Key, and Value vectors +- NH: Number of attention heads +- HS: Head size (HS = C / NH, where C is the total channel size) + +Only the Q and K portions of the input are used in this computation. + +### Output Tensor + +The output tensor has shape: + +(B, NH, T, T) + +Each output slice [b, nh] contains the attention scores for batch b and head nh. +Values above the diagonal (i.e., when a token would attend to a future token) are ignored or masked (e.g., set to NaN). + +### Configuration Used + +For benchmarking and validation, the implementation uses the following fixed configuration: + +- B = 8 (batch size) +- T = 1024 (sequence length) +- C = 768 (total channels) +- NH = 12 (number of heads) +- HS = 64 (head size = C / NH) + +### Computation Goal + +The goal is to compute the scaled dot-product attention score between queries and keys: + +$$\text{out}[b][h][i][j] = \frac{Q[b][i][h] \cdot K[b][j][h]}{\sqrt{\text{HS}}} \quad \text{for } j \leq i$$ + +That is, for each batch $b$, head $h$, and timestep pair $(i, j)$ such that $j <= i$, we compute the dot product between query vector $Q[b][i][h]$ and key vector $K[b][j][h]$. +The upper triangle $(j > i)$ is skipped or masked due to the causal attention constraint. + +## Kernel 1: Naive Implementation (`matmul_tri_naive`) + +This is the baseline GPU kernel, designed for clarity and correctness rather than performance. Each thread is responsible for computing an **8×8 tile** of the output attention matrix using a straightforward triple-nested loop. There are **no memory optimizations**; all reads are done directly from global memory. It is intentionally simple and mirrors a CPU-style nested loop structure to show what an unoptimized CUDA implementation looks like. + +### Key Characteristics of Kernel 1 + +- **No shared memory** or caching +- **Each thread loads Q[i] and K[j]** directly from global memory +- Computes **64 dot products** per thread (8 queries × 8 keys) +- Causal masking is enforced by skipping blocks where j > i +- **Upper triangle is ignored**, though some redundant work may occur inside diagonal blocks + +Below is a visualization of how threads compute 8×8 blocks in the output matrix: + +
+ Kernel 1 Diagram +
+ +## Kernel 2: Register Tiling + +This kernel improves performance by leveraging **register tiling**. Each thread still computes an **8×8 tile** of the output, but instead of reading query and key vectors from global memory multiple times, each thread loads its Q and K vectors into registers for reuse. + +### Key Characteristics of Kernel 2 + +- One thread per **8×8 tile**, same as Kernel 1 +- Q and K values are loaded into **`float lhs[8]` and `float rhs[8]`** arrays in registers +- Loops over the head size (`HS`) to compute 64 dot products per thread +- **No shared memory**, but much better memory locality than Kernel 1 +- Still performs some redundant computation above the diagonal (ignored due to masking) +- Faster than Kernel 1 due to fewer global loads + +See **Figure 2** for a visualization of how registers are used to tile the data within a thread: + +
+ Kernel 2 Diagram +
+ +## Kernel 3: Vectorized Loads + +This kernel builds on Kernel 2 by introducing **vectorized and coalesced memory access** using `float4` loads. The goal is to improve global memory bandwidth utilization by aligning reads and writes to 16-byte boundaries. + +### Key Characteristics of Kernel 3 + +- Each thread still computes an 8×8 tile (64 dot products) +- Q and K values are loaded using `float4` for better memory coalescing +- Improves memory access patterns by reducing the number of memory transactions +- No shared memory; only register reuse + vectorized reads and writes +- Uses `ld_vec()` and `st_vec()` helper functions to safely cast pointers to `float4` +- Faster than Kernel 2 due to reduced memory traffic + +## Kernel 4: Shared Memory Tiling (`matmul_tri4`) + +This kernel introduces **shared memory tiling** to improve memory reuse across threads in a thread block. Threads collaborate to load tiles of the Q and K matrices into shared memory, significantly reducing global memory accesses. + +### Key Characteristics of Kernel 4 + +- Uses shared memory arrays: `lhs_s[128][32]`, `rhs_s[128][32]` +- 16×16 threads cooperatively load **128 rows × 32 dimensions** from Q and K into shared memory +- Computes 8×8 tiles per thread, iterating over `HS / 32` slices to accumulate dot products +- Final results are written with **vectorized `float4` stores** for efficient global memory writes + +See **Figure 4** for an illustration of shared memory tiling and accumulation: + +
+ Kernel 4 Diagram +
+ +## References + +1. [llm.c CUDA kernels](https://github.com/karpathy/llm.c/tree/master/dev/cuda) +2. [Scaled Dot-Product Attention (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) +3. [CUDA Programming Guide: Memory Coalescing](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-coalescing) + -{{#author VectorInstitute}} +{{#author kohankhaki}} From ce4567e38f28acc8e1ff51e4636590365432baff Mon Sep 17 00:00:00 2001 From: Farnaz Kohankhaki Date: Sun, 22 Jun 2025 20:18:52 -0700 Subject: [PATCH 2/5] Fixed issues. --- .../src/cuda/kernels/trimat_forward.md | 257 +++++++++++++----- 1 file changed, 183 insertions(+), 74 deletions(-) diff --git a/books/compute/src/cuda/kernels/trimat_forward.md b/books/compute/src/cuda/kernels/trimat_forward.md index 55a0b14..a6c5e49 100644 --- a/books/compute/src/cuda/kernels/trimat_forward.md +++ b/books/compute/src/cuda/kernels/trimat_forward.md @@ -1,7 +1,4 @@ - - - - + # Kernels for Triangular Matrix Multiplication (Trimat) Forward Pass @@ -11,142 +8,254 @@ ## Introduction -This guide demonstrates efficient GPU implementations of **triangular matrix multiplication**, as used in **causal self-attention** in autoregressive transformer models. -For **causal (autoregressive) attention**, we only need the **lower triangle** of the attention matrix; that is, each token should only attend to current and previous tokens. +This pocket reference provides efficient GPU implementations of **triangular +matrix multiplication**, as used in **causal self-attention** in autoregressive +transformer models. +For **causal (autoregressive) attention**, we only need the **lower triangle** +of the attention matrix. That is, each token should only attend to current and +previous tokens. Computing the full matrix is wasteful when only the lower triangle is needed. -Triangular matrix multiplication is a specialized form of matrix multiplication where, instead of computing the full output matrix, only the **lower triangle** is computed. This leads to substantial computational savings. - -This guide explains a series of CUDA kernel implementations for the **Trimat forward pass**, based on the [llm.c](https://github.com/karpathy/llm.c/tree/master/dev/cuda) GitHub repository. -These kernels avoid unnecessary computation and offer potential speedups over cuBLAS. These kernels are introduced in increasing order of optimization: - -- [Kernel 1: `matmul_tri_naive`](#kernel-1-naive-implementation-matmul_tri_naive): a simple nested loop implementation with no memory optimization. -- [Kernel 2: `matmul_tri_registers`](#kernel-2-register-tiling): uses **register tiling** to reduce redundant memory loads. -- [Kernel 3: `matmul_tri3`](#kernel-3-vectorized-loads): adds **vectorized memory access** using `float4` to improve memory coalescing. -- [Kernel 4: `matmul_tri4`](#kernel-4-shared-memory-tiling-matmul_tri4): leverages **shared memory** tiling for inter-thread data reuse and further performance gains. - -The next section, [Input, Output, and Computation](#input-output-and-computation), describes the tensor shapes, the configuration used for benchmarking, and the exact computation performed during the Trimat forward pass. +Triangular matrix multiplication is a specialized form of matrix multiplication, +where instead of computing the full output matrix, only the **lower triangle** +is computed. This leads to substantial computational savings. + +This guide explains a series of CUDA kernel implementations for the **trimat +forward pass**, based on the [llm.c](https://github.com/karpathy/llm.c/tree/master/dev/cuda) +GitHub repository. +These kernels avoid unnecessary computation and offer potential speedups over +cuBLAS. They are introduced in increasing order of optimization: + +- [Kernel 1: `matmul_tri_naive`](#kernel-1-naive-implementation-matmul_tri_naive): + A simple nested loop implementation with no memory optimization. +- [Kernel 2: `matmul_tri_registers`](#kernel-2-register-tiling-matmul_tri_registers): Uses + **register tiling** to reduce redundant memory loads. +- [Kernel 3: `matmul_tri3`](#kernel-3-vectorized-loads-matmul_tri3): Adds **vectorized + memory access** using `float4` to improve memory coalescing. +- [Kernel 4: `matmul_tri4`](#kernel-4-shared-memory-tiling-matmul_tri4): + Leverages **shared memory** tiling for inter-thread data reuse and further + performance gains. + +The next section, [Input, Output, and +Computation](#input-output-and-computation), describes the tensor shapes, the +configuration used in the examples, and the exact computation performed during +the trimat forward pass. ## Input, Output, and Computation -This section describes the structure of the input/output tensors and the computation performed by the Trimat kernels. +This section describes the structure of the input/output tensors and the +computation performed by the trimat kernels. ### Input Tensor -The input tensor packs queries and keys (and values, though unused here) in the shape: +The input tensor packs queries and keys (and values, though unused here) in +the shape: +$$ (B, T, 3, NH, HS) +$$ -Where: +where: -- B: Batch size -- T: Sequence length -- 3: Stacked Query, Key, and Value vectors -- NH: Number of attention heads -- HS: Head size (HS = C / NH, where C is the total channel size) +- \\(B\\): Batch size +- \\(T\\): Sequence length +- \\(3\\): Stacked Query, Key, and Value vectors +- \\(NH\\): Number of attention heads +- \\(HS\\): Head size, where \\(HS = C / NH\\) and \\(C\\) is the total + channel size -Only the Q and K portions of the input are used in this computation. +Only the \\(Q\\) and \\(K\\) portions of the input are used in this +computation. ### Output Tensor The output tensor has shape: +$$ (B, NH, T, T) +$$ -Each output slice [b, nh] contains the attention scores for batch b and head nh. -Values above the diagonal (i.e., when a token would attend to a future token) are ignored or masked (e.g., set to NaN). +where: -### Configuration Used +- \\(B\\): Batch size +- \\(NH\\): Number of attention heads +- \\(T\\): Sequence length (used for both dimensions of the attention + matrix) -For benchmarking and validation, the implementation uses the following fixed configuration: +Each output slice \\([b, nh]\\) contains the attention scores for batch \\(b\\) +and head \\(nh\\). +Values above the diagonal (i.e., when a token would attend to a future token) +are ignored or masked (e.g., set to NaN). -- B = 8 (batch size) -- T = 1024 (sequence length) -- C = 768 (total channels) -- NH = 12 (number of heads) -- HS = 64 (head size = C / NH) +### Configuration Used -### Computation Goal +The configurations used in the examples are: -The goal is to compute the scaled dot-product attention score between queries and keys: +- \\(B = 8\\): Batch size +- \\(T = 1024\\): Sequence length +- \\(C = 768\\): Total channels +- \\(NH = 12\\): Number of heads +- \\(HS = 64\\): Head size, where \\(HS = C / NH\\) -$$\text{out}[b][h][i][j] = \frac{Q[b][i][h] \cdot K[b][j][h]}{\sqrt{\text{HS}}} \quad \text{for } j \leq i$$ +### Computation Goal -That is, for each batch $b$, head $h$, and timestep pair $(i, j)$ such that $j <= i$, we compute the dot product between query vector $Q[b][i][h]$ and key vector $K[b][j][h]$. -The upper triangle $(j > i)$ is skipped or masked due to the causal attention constraint. +The goal is to compute the scaled dot-product attention score between queries +and keys: + +$$ +\text{out}[b][h][i][j] = \frac{Q[b][i][h] \cdot K[b][j][h]}{\sqrt{\text{HS}}} +\quad \text{for } j \leq i +$$ + +That is, for each batch \\((b)\\), head \\((h)\\), and timestep pair \\((i, j)\\) +such that \\(j \leq i\\), we compute the dot product between query vector +\\(Q\[b\]\[i\]\[h\]\\) and key vector \\(K\[b\]\[j\]\[h\]\\). +The upper triangle \\((j > i)\\) is skipped or masked due to the causal +attention constraint. + +### Mathematical Illustration + +To illustrate what this computation is accomplishing mathematically, consider +the following example: + +Let \\(X\\) and \\(Y\\) be two 3×3 matrices. In a full matrix multiplication, +we would compute: + +$$ +Z = X \cdot Y = +\begin{bmatrix} +\sum_{i=1}^3 x_{1,i} y_{i,1} & \sum_{i=1}^3 x_{1,i} y_{i,2} & +\sum_{i=1}^3 x_{1,i} y_{i,3} \\\\ +\sum_{i=1}^3 x_{2,i} y_{i,1} & \sum_{i=1}^3 x_{2,i} y_{i,2} & +\sum_{i=1}^3 x_{2,i} y_{i,3} \\\\ +\sum_{i=1}^3 x_{3,i} y_{i,1} & \sum_{i=1}^3 x_{3,i} y_{i,2} & +\sum_{i=1}^3 x_{3,i} y_{i,3} +\end{bmatrix} +$$ + +However, in **triangular (causal) matrix multiplication**, we only compute the +**lower triangle**: + +$$ +Z_{\text{causal}} = +\begin{bmatrix} +\sum_{i=1}^3 x_{1,i} y_{i,1} & 0 & 0 \\\\ +\sum_{i=1}^3 x_{2,i} y_{i,1} & \sum_{i=1}^3 x_{2,i} y_{i,2} & 0 \\\\ +\sum_{i=1}^3 x_{3,i} y_{i,1} & \sum_{i=1}^3 x_{3,i} y_{i,2} & +\sum_{i=1}^3 x_{3,i} y_{i,3} +\end{bmatrix} +$$ + +This ensures that each row \\(i\\) only attends to columns \\(j \leq i\\), +enforcing the causal constraint. ## Kernel 1: Naive Implementation (`matmul_tri_naive`) -This is the baseline GPU kernel, designed for clarity and correctness rather than performance. Each thread is responsible for computing an **8×8 tile** of the output attention matrix using a straightforward triple-nested loop. There are **no memory optimizations**; all reads are done directly from global memory. It is intentionally simple and mirrors a CPU-style nested loop structure to show what an unoptimized CUDA implementation looks like. +This is the baseline GPU kernel, designed for clarity and correctness rather +than performance. +Each thread is responsible for computing an **8×8 tile** of the output +attention matrix using a straightforward triple-nested loop. +There are **no memory optimizations**; all reads are done directly from global +memory. +It is intentionally simple and mirrors a CPU-style nested loop structure to +show what an unoptimized CUDA implementation looks like. ### Key Characteristics of Kernel 1 -- **No shared memory** or caching -- **Each thread loads Q[i] and K[j]** directly from global memory -- Computes **64 dot products** per thread (8 queries × 8 keys) -- Causal masking is enforced by skipping blocks where j > i -- **Upper triangle is ignored**, though some redundant work may occur inside diagonal blocks +- **No shared memory** or caching. +- **Each thread loads \\(Q[i]\\) and \\(K[j]\\)** directly from global memory. +- Computes **64 dot products** per thread (8 queries × 8 keys). +- Causal masking is enforced by skipping blocks where \\(j > i\\). +- **Upper triangle is ignored**, though some redundant work may occur inside + diagonal blocks. -Below is a visualization of how threads compute 8×8 blocks in the output matrix: +Below is a visualization of how threads compute 8×8 blocks in the output +matrix: +
- Kernel 1 Diagram +Kernel 1 Diagram
+
-## Kernel 2: Register Tiling +## Kernel 2: Register Tiling (`matmul_tri_registers`) -This kernel improves performance by leveraging **register tiling**. Each thread still computes an **8×8 tile** of the output, but instead of reading query and key vectors from global memory multiple times, each thread loads its Q and K vectors into registers for reuse. +This kernel improves performance by leveraging **register tiling**. +Each thread still computes an **8×8 tile** of the output, but instead of +reading query and key vectors from global memory multiple times, each thread +loads its \\(Q\\) and \\(K\\) vectors into registers for reuse. ### Key Characteristics of Kernel 2 -- One thread per **8×8 tile**, same as Kernel 1 -- Q and K values are loaded into **`float lhs[8]` and `float rhs[8]`** arrays in registers -- Loops over the head size (`HS`) to compute 64 dot products per thread -- **No shared memory**, but much better memory locality than Kernel 1 -- Still performs some redundant computation above the diagonal (ignored due to masking) -- Faster than Kernel 1 due to fewer global loads +- One thread per **8×8 tile**, same as Kernel 1. +- \\(Q\\) and \\(K\\) values are loaded into **`float lhs[8]` and `float rhs[8]`** + arrays in registers. +- Loops over the head size \\((HS)\\) to compute 64 dot products per thread. +- **No shared memory**, but much better memory locality than Kernel 1. +- Still performs some redundant computation above the diagonal (ignored due to + masking). +- Faster than Kernel 1 due to fewer global loads. -See **Figure 2** for a visualization of how registers are used to tile the data within a thread: +See **Figure 2** for a visualization of how registers are used to tile the data +within a thread: +
- Kernel 2 Diagram +Kernel 2 Diagram
+
-## Kernel 3: Vectorized Loads +## Kernel 3: Vectorized Loads (`matmul_tri3`) -This kernel builds on Kernel 2 by introducing **vectorized and coalesced memory access** using `float4` loads. The goal is to improve global memory bandwidth utilization by aligning reads and writes to 16-byte boundaries. +This kernel builds on Kernel 2 by introducing **vectorized and coalesced +memory access** using `float4` loads. +The goal is to improve global memory bandwidth utilization by aligning reads +and writes to 16-byte boundaries. ### Key Characteristics of Kernel 3 -- Each thread still computes an 8×8 tile (64 dot products) -- Q and K values are loaded using `float4` for better memory coalescing -- Improves memory access patterns by reducing the number of memory transactions -- No shared memory; only register reuse + vectorized reads and writes -- Uses `ld_vec()` and `st_vec()` helper functions to safely cast pointers to `float4` -- Faster than Kernel 2 due to reduced memory traffic +- Each thread still computes an **8×8 tile** (64 dot products). +- \\(Q\\) and \\(K\\) values are loaded using `float4` for better memory coalescing. +- Improves memory access patterns by reducing the number of memory + transactions. +- No shared memory; only register reuse + vectorized reads and writes. +- Uses `ld_vec()` and `st_vec()` helper functions to safely cast pointers to + `float4`. +- Faster than Kernel 2 due to reduced memory traffic. ## Kernel 4: Shared Memory Tiling (`matmul_tri4`) -This kernel introduces **shared memory tiling** to improve memory reuse across threads in a thread block. Threads collaborate to load tiles of the Q and K matrices into shared memory, significantly reducing global memory accesses. +This kernel introduces **shared memory tiling** to improve memory reuse across +threads in a thread block. +Threads collaborate to load tiles of the \\(Q\\) and \\(K\\) matrices into +shared memory, +significantly reducing global memory accesses. ### Key Characteristics of Kernel 4 -- Uses shared memory arrays: `lhs_s[128][32]`, `rhs_s[128][32]` -- 16×16 threads cooperatively load **128 rows × 32 dimensions** from Q and K into shared memory -- Computes 8×8 tiles per thread, iterating over `HS / 32` slices to accumulate dot products -- Final results are written with **vectorized `float4` stores** for efficient global memory writes +- Uses shared memory arrays: `lhs_s[128][32]`, `rhs_s[128][32]`. +- 16×16 threads cooperatively load **128 rows × 32 dimensions** from \\(Q\\) + and \\(K\\) into shared memory. +- Computes **8×8 tiles** per thread, iterating over \\(HS / 32\\) slices to + accumulate dot products. +- Final results are written with **vectorized `float4` stores** for efficient + global memory writes. See **Figure 4** for an illustration of shared memory tiling and accumulation: +
- Kernel 4 Diagram +Kernel 4 Diagram
+
## References 1. [llm.c CUDA kernels](https://github.com/karpathy/llm.c/tree/master/dev/cuda) -2. [Scaled Dot-Product Attention (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) -3. [CUDA Programming Guide: Memory Coalescing](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-coalescing) +2. [Scaled Dot-Product Attention (Vaswani et al., + 2017)](https://arxiv.org/abs/1706.03762) +3. [CUDA Programming Guide: Memory + Coalescing](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-coalescing) From 88534e1521b229266b4d0f535f064c2e53099d52 Mon Sep 17 00:00:00 2001 From: Farnaz Kohankhaki Date: Sun, 22 Jun 2025 20:33:39 -0700 Subject: [PATCH 3/5] enabled MD013 --- books/compute/src/cuda/kernels/trimat_forward.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/books/compute/src/cuda/kernels/trimat_forward.md b/books/compute/src/cuda/kernels/trimat_forward.md index a6c5e49..5713f9f 100644 --- a/books/compute/src/cuda/kernels/trimat_forward.md +++ b/books/compute/src/cuda/kernels/trimat_forward.md @@ -1,4 +1,4 @@ - + # Kernels for Triangular Matrix Multiplication (Trimat) Forward Pass @@ -28,8 +28,8 @@ cuBLAS. They are introduced in increasing order of optimization: - [Kernel 1: `matmul_tri_naive`](#kernel-1-naive-implementation-matmul_tri_naive): A simple nested loop implementation with no memory optimization. -- [Kernel 2: `matmul_tri_registers`](#kernel-2-register-tiling-matmul_tri_registers): Uses - **register tiling** to reduce redundant memory loads. +- [Kernel 2: `matmul_tri_registers`](#kernel-2-register-tiling-matmul_tri_registers): + Uses **register tiling** to reduce redundant memory loads. - [Kernel 3: `matmul_tri3`](#kernel-3-vectorized-loads-matmul_tri3): Adds **vectorized memory access** using `float4` to improve memory coalescing. - [Kernel 4: `matmul_tri4`](#kernel-4-shared-memory-tiling-matmul_tri4): From 9d6891b01432cb765fbd395bc2e681f79a2a53a0 Mon Sep 17 00:00:00 2001 From: Farnaz Kohankhaki Date: Sun, 22 Jun 2025 21:06:18 -0700 Subject: [PATCH 4/5] fixed trimat spelling. --- books/compute/src/cuda/kernels/trimat_forward.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/books/compute/src/cuda/kernels/trimat_forward.md b/books/compute/src/cuda/kernels/trimat_forward.md index 5713f9f..3bdf5cf 100644 --- a/books/compute/src/cuda/kernels/trimat_forward.md +++ b/books/compute/src/cuda/kernels/trimat_forward.md @@ -1,6 +1,6 @@ -# Kernels for Triangular Matrix Multiplication (Trimat) Forward Pass +# Kernels for Triangular Matrix Multiplication (Trimat Forward Pass) {{ #aipr_header }} @@ -20,8 +20,8 @@ Triangular matrix multiplication is a specialized form of matrix multiplication, where instead of computing the full output matrix, only the **lower triangle** is computed. This leads to substantial computational savings. -This guide explains a series of CUDA kernel implementations for the **trimat -forward pass**, based on the [llm.c](https://github.com/karpathy/llm.c/tree/master/dev/cuda) +This guide explains a series of CUDA kernel implementations for the **Trimat +Forward Pass**, based on the [llm.c](https://github.com/karpathy/llm.c/tree/master/dev/cuda) GitHub repository. These kernels avoid unnecessary computation and offer potential speedups over cuBLAS. They are introduced in increasing order of optimization: @@ -39,7 +39,7 @@ cuBLAS. They are introduced in increasing order of optimization: The next section, [Input, Output, and Computation](#input-output-and-computation), describes the tensor shapes, the configuration used in the examples, and the exact computation performed during -the trimat forward pass. +the Trimat Forward Pass. ## Input, Output, and Computation From 6310ff53f5409172aea81d997954919649c55145 Mon Sep 17 00:00:00 2001 From: Val Andrei Fajardo Date: Mon, 23 Jun 2025 12:07:54 -0400 Subject: [PATCH 5/5] hosted images --- .../src/cuda/kernels/figures/trimat_forward/kernel1.svg | 4 ---- .../src/cuda/kernels/figures/trimat_forward/kernel2.svg | 4 ---- .../src/cuda/kernels/figures/trimat_forward/kernel4.svg | 4 ---- books/compute/src/cuda/kernels/trimat_forward.md | 9 ++++----- 4 files changed, 4 insertions(+), 17 deletions(-) delete mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg delete mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg delete mode 100644 books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg deleted file mode 100644 index 1138b3b..0000000 --- a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel1.svg +++ /dev/null @@ -1,4 +0,0 @@ - - -TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0B diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg deleted file mode 100644 index ec49bd9..0000000 --- a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel2.svg +++ /dev/null @@ -1,4 +0,0 @@ - - -TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0Blhs[8], rhs[8], val[8][8]lhs[8], rhs[8], val[8][8]lhs[8], rhs[8], val[8][8]registerregisterregister diff --git a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg b/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg deleted file mode 100644 index 81acfe6..0000000 --- a/books/compute/src/cuda/kernels/figures/trimat_forward/kernel4.svg +++ /dev/null @@ -1,4 +0,0 @@ - - -TNHTblock 0,0,0t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t1,1t3,0t2,0t4,0...t14,0t15,0block 0,0,1t0,0t0,1t0,2t0,3t0,4...t0,14t0,15t1,0t3,0t2,0t4,0...t14,0t15,0Bshared memory - K0,00,1...0,160,17...0,300,311,08,0.........126,0127,0shared memory - Q0,00,1...0,160,17...0,300,311,08,0......126,0127,0...shared memory - Q0,00,1...0,160,17...0,300,311,08,0......126,0127,0...shared memory - K0,00,1...0,80,9...0,300,311,08,0......126,0127,0...loadcompute* diff --git a/books/compute/src/cuda/kernels/trimat_forward.md b/books/compute/src/cuda/kernels/trimat_forward.md index 3bdf5cf..e2c1d80 100644 --- a/books/compute/src/cuda/kernels/trimat_forward.md +++ b/books/compute/src/cuda/kernels/trimat_forward.md @@ -174,7 +174,7 @@ matrix:
-Kernel 1 Diagram +Kernel 1 Diagram
@@ -201,7 +201,7 @@ within a thread:
-Kernel 2 Diagram +Kernel 2 Diagram
@@ -245,15 +245,14 @@ See **Figure 4** for an illustration of shared memory tiling and accumulation:
-Kernel 4 Diagram +Kernel 4 Diagram
## References 1. [llm.c CUDA kernels](https://github.com/karpathy/llm.c/tree/master/dev/cuda) -2. [Scaled Dot-Product Attention (Vaswani et al., - 2017)](https://arxiv.org/abs/1706.03762) +2. [Scaled Dot-Product Attention (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) 3. [CUDA Programming Guide: Memory Coalescing](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-coalescing)