Week 1: Intro to GPUs and writing your first kernel!
Can you guess which architecture more closely resembles a CPU? What about a GPU?
Motivation for GPUs in Deep Learning
A gentle introduction to CUDA
Further resources/references to use:
PMPP Book Access
NVIDIA GPU Glossary
Week 2: Learning to optimize your kernels!
From the image, how many FLOPS (floating point operations) are in matrix multiplication?
Aalto University's Course on GPU Programming
Simon's Blog on SGEMM (Kernels 1-5 are the most relevant for the assignment)
How to use NCU profiler
Roofline Models
Further references to use:
NCU Documentation
Week 3 and 4: Learning to optimize with Tensor Cores!
How much faster are Tensor Core operations compared to F32 CUDA Cores?
A sequel to Simon's Blog in HGEMM
Bruce's Blog on HGEMM
Spatter's Blog on HGEMM
NVIDIA's Presentation on A100 Tensor Cores
Further references to use:
Primer on Inline PTX Assembly
CUTLASS GEMM Documentation
NVIDIA PTX ISA Documentation (Chapter 9.7 is most relevant)
Week 6: Exploring other optimization parallel techniques!
How could we compute the sum of all the elements in a 1-million sized vector?
Primer on Parallel Reduction
Warp level Primitives
Vectorization
Efficient Softmax Kernel
Online Softmax Paper
Week 7 & 8: Putting it all together in Flash Attention!
Is the self-attention layer in LLMs compute-bound or memory-bound?
Flash Attention V1 Paper
Aleksa Gordic's Flash Attention Blog