Skip to content

Latest commit

 

History

History
39 lines (26 loc) · 1.08 KB

File metadata and controls

39 lines (26 loc) · 1.08 KB

Matrix Multiplication

Matrix multiplication implementation with two versions: naive and tiled (shared memory).

Description

This program implements matrix multiplication C = A × B using CUDA. It includes two implementations:

  • Naive kernel: Basic implementation where each thread computes one element of the result matrix
  • Tiled kernel: Optimized version using shared memory to reduce global memory accesses

This demonstrates key CUDA concepts like shared memory, tiling, and memory coalescing.

Building and Running

cd matrix_mult
nvcc main.cpp matrix_mult.cu -o matrix_mult
./matrix_mult

Features

  • CPU reference implementation for verification
  • Both naive and tiled CUDA kernels
  • Automatic correctness checking
  • Configurable matrix dimensions

How it Works

Naive Kernel

  • Each thread computes one element of the result matrix
  • Direct access to global memory for all data

Tiled Kernel

  • Uses shared memory tiles (16×16) to cache data
  • Reduces global memory accesses significantly
  • Demonstrates the importance of memory optimization in CUDA