GPU kernels for N:M structured sparse matrix multiplication (50% sparsity, 2:4 pattern) on NVIDIA RTX 4070. Report is located in root as GPU.pdf
course_project/
├── include/
│ └── sparse_kernels.h # Function declarations
├── src/
│ ├── nmsparse_baseline.cu # Single-buffered baseline
│ ├── nmsparse_prefetch.cu # Register-based prefetching
│ ├── nmsparse_double_buffer.cu # Async double-buffer + Split-K
│ ├── nmspmm_kernel.cu # Personal Project, not related
│ ├── cublas_gemm.cu # cuBLAS dense baseline
│ └── helpers.cu # Data initialization and verification
├── tests/
│ └── test_kernels.cu # Benchmark harness
├── Makefile
├── profile_ncu.sh # Nsight Compute profiling
└── analyze_ncu.py # Profile analysis
Single-buffered sparse SpMM with synchronous loads.
- 32×32 block tiles, 4×4 thread tiles
- 64 sparse elements per K-tile
- Straightforward index-based gathering
Register-based prefetching to overlap memory and computation.
- Loads next tile into registers during computation
- Same tiling as baseline
- Tests latency hiding without hardware async
PTX async copies with ping-pong buffers and Split-K.
cp.asyncinstructions for hardware acceleration- 32 sparse elements per K-tile (smaller to fit double buffers)
- Split-K=2 for additional parallelism
- Requires Ampere+ (sm_80+)
Optimized dense GEMM for baseline comparison.
make clean all# Default (1024×1024×1024)
make run
# All sizes (64 to 8192)
make run-all
# Custom
./build/test_sparse_kernels <M> <N> <K> [warmup] [iterations]
./build/test_sparse_kernels 2048 2048 2048 10 100make debug-test # 64×64×64 with all variants# Profile with Nsight Compute
./profile_ncu.sh 1024 1024 1024
# Generates: ncu_profile_1024x1024x1024.csv
# ncu_profile_1024x1024x1024_summary.csv# Analyze profiling data
python3 analyze_ncu.py <input.csv> [output.csv]
## Scripts
- `profile_ncu.sh`: Collects 20+ hardware metrics (DRAM throughput, SM utilization, bank conflicts, warp stalls)
- `analyze_ncu.py`: Processes raw NCU CSV into statistical summaries
## Prerequisites
Tested on NYU Courant Institute cuda5 server (RTX 4070, CUDA 12.4).