Matrix Multiplication

Matrix multiplication implementation with two versions: naive and tiled (shared memory).

Description

This program implements matrix multiplication C = A × B using CUDA. It includes two implementations:

Naive kernel: Basic implementation where each thread computes one element of the result matrix
Tiled kernel: Optimized version using shared memory to reduce global memory accesses

This demonstrates key CUDA concepts like shared memory, tiling, and memory coalescing.

cd matrix_mult
nvcc main.cpp matrix_mult.cu -o matrix_mult
./matrix_mult