Matrix multiplication implementation with two versions: naive and tiled (shared memory).
This program implements matrix multiplication C = A × B using CUDA. It includes two implementations:
- Naive kernel: Basic implementation where each thread computes one element of the result matrix
- Tiled kernel: Optimized version using shared memory to reduce global memory accesses
This demonstrates key CUDA concepts like shared memory, tiling, and memory coalescing.
cd matrix_mult
nvcc main.cpp matrix_mult.cu -o matrix_mult
./matrix_mult- CPU reference implementation for verification
- Both naive and tiled CUDA kernels
- Automatic correctness checking
- Configurable matrix dimensions
- Each thread computes one element of the result matrix
- Direct access to global memory for all data
- Uses shared memory tiles (16×16) to cache data
- Reduces global memory accesses significantly
- Demonstrates the importance of memory optimization in CUDA