Welcome to the vast universe of NVIDIA's CUDA ecosystem. When starting out, it's easy to get overwhelmed by the alphabet soup of libraries (cuBLAS, cuDNN, NCCL, Thrust...).
The Golden Rule: Don't write a kernel if a library already does it faster.
This guide explains What each library is, When to use it, and How it fits into your journey.
- What is it?: The standard way to program CUDA. Functions like
cudaMalloc,cudaMemcpy, and kernel launches<<<...>>>. - Difficulty: ⭐ Beginner
- When to use: ALWAYS. This is your default starting point for 99% of projects.
- How: Include
<cuda_runtime.h>. (Already verified in this repo's examples!).
- What is it?: A lower-level, C-style API that talks directly to the driver (
cuCtxCreate,cuMemAlloc). - Difficulty: ⭐⭐⭐ Hard
- When to use: RARELY. Only if you are writing a language binding (like making a Python library to talk to GPU) or need manual context management.
- Advice: Stick to
cudart.
- What is it?: A compiler library that lets you compile CUDA C++ code strings into PTX (assembly) while your program is running.
- Difficulty: ⭐⭐⭐ Advanced
- When to use: If you are building an app where the user defines the math logic at runtime (like a database query engine or Python's Numba/CuPy).
- What is it?: Enables Dynamic Parallelism (launching new kernels from inside a kernel).
- Difficulty: ⭐⭐ Intermediate
- When to use: When your algorithm is recursive (like Graph traversal or Quadtrees) and you don't know how much work is needed until you are already on the GPU.
- What is it?: The GPU standard for Matrix Multiplication (
GEMM) and vector math. - Difficulty: ⭐ Beginner/Intermediate
- When to use: Anytime you need
Matrix x MatrixorMatrix x Vector. Do not write your own MatMul kernel unless it's for learning;cuBLASis hand-tuned by wizards. - How:
cublasCreate(),cublasSgemm().
- What is it?: A more flexible, lower-level version of cuBLAS for advanced matrix math (INT8, FP8, mixed precision).
- Difficulty: ⭐⭐⭐ Advanced
- When to use: If you are optimizing large LLMs or need very specific fusion/quantization support that standard cuBLAS lacks.
- What is it?: Linear algebra for Sparse Matrices (matrices where most values are zero).
- Difficulty: ⭐⭐ Intermediate
- When to use: Physics simulations, Graph algorithms, or recommender systems where storing
0s is a waste of RAM.
- What is it?: High-level linear algebra solvers (Eigenvalues, SVD, Cholesky decomposition).
- Difficulty: ⭐⭐ Intermediate
- When to use: Scientific computing applications needing complex matrix factorizations, not just multiplication.
- What is it?: Fast Fourier Transform (Signal Processing).
- Difficulty: ⭐ Beginner/Intermediate
- When to use: Audio processing, fluid dynamics, image filtering. It is incredibly fast.
- What is it?: Random Number Generation on GPU.
- Difficulty: ⭐ Beginner
- When to use: Monte Carlo simulations or initializing Neural Network weights. Generates billions of random numbers in parallel.
- What is it?: High-performance tensor contractions (multi-dimensional matrix math).
- Difficulty: ⭐⭐ Intermediate
- When to use: If you are doing math on 4D, 5D arrays (common in physics and deep learning) and standard matrix math isn't enough.
- What is it?: The engine room of Deep Learning. Provides primitives like Convolutions, Pooling, Softmax, Attention.
- Difficulty: ⭐⭐⭐ Hard
- When to use: If you are writing your own Deep Learning framework (like PyTorch/TensorFlow) from scratch.
- Note: Most users use PyTorch/TF, which call
cuDNNfor you.
- What is it?: An Inference Optimizer. It takes a trained model and optimizes it to run as fast as possible on specific hardware.
- Difficulty: ⭐⭐ Intermediate
- When to use: Deploying a model to production (robotics, cloud service) where latency (ms) matters.
- What is it?: The "C++ STL for CUDA".
- Difficulty: ⭐ Beginner (Easiest!)
- When to use: Start here! If you need to
Sort,Reduce(sum),Scan, orTransformvectors. - Why: You can write a blazing fast GPU sort in 5 lines of code.
thrust::sort(d_vec.begin(), d_vec.end());
- What is it?: Building blocks for writing your own kernels. Provides efficient implementation of Block/Warp level operations (WarpReduce, BlockScan).
- Difficulty: ⭐⭐⭐ Advanced
- When to use: When writing custom kernels but you want robust reusable components for parallel primitives inside the block.
- What is it?: Templates for high-performance matrix multiplication.
- Difficulty: ⭐⭐⭐⭐ Expert
- When to use: If you need to write a custom Matrix Multiplication layer that does something weird (like fused activation) inside the inner loop and cuBLAS doesn't support it.
- What is it?: "Nickel". Library for multi-GPU communication.
- Difficulty: ⭐⭐ Intermediate
- When to use: Training heavy AI models across 8 GPUs or multiple nodes. It handles
AllReduce(averaging gradients) efficiently over NVLink/Infiniband.
- What is it?: Profiling markers.
- Difficulty: ⭐ Beginner
- When to use: When profiling your app with Nsight Systems. You can "name" parts of your code (e.g., "Image Preprocessing") so they show up as named colored bars in the timeline.
| Task | Recommendation |
|---|---|
| "I just want to add arrays." | Use cudart (write a simple kernel) or Thrust. |
| "I need to sort a list." | Use Thrust. Do not write a sort kernel. |
| "I need to multiply matrices." | Use cuBLAS. |
| "I'm training a Neural Network." | Use PyTorch/TF (which use cuDNN). |
| "I'm deploying a Neural Network." | Use TensorRT. |
| "I'm profiling my code." | Add NVTX markers. |
| "I need random numbers." | Use cuRAND. |
Happy Computing! 🚀