Skip to content

Latest commit

 

History

History
141 lines (106 loc) · 6.98 KB

File metadata and controls

141 lines (106 loc) · 6.98 KB

🌌 The CUDA Ecosystem: A Hitchhiker's Guide for Beginners

Welcome to the vast universe of NVIDIA's CUDA ecosystem. When starting out, it's easy to get overwhelmed by the alphabet soup of libraries (cuBLAS, cuDNN, NCCL, Thrust...).

The Golden Rule: Don't write a kernel if a library already does it faster.

This guide explains What each library is, When to use it, and How it fits into your journey.


🏗️ Core APIs: The Foundation

1. cudart (CUDA Runtime API)

  • What is it?: The standard way to program CUDA. Functions like cudaMalloc, cudaMemcpy, and kernel launches <<<...>>>.
  • Difficulty: ⭐ Beginner
  • When to use: ALWAYS. This is your default starting point for 99% of projects.
  • How: Include <cuda_runtime.h>. (Already verified in this repo's examples!).

2. cuda (CUDA Driver API)

  • What is it?: A lower-level, C-style API that talks directly to the driver (cuCtxCreate, cuMemAlloc).
  • Difficulty: ⭐⭐⭐ Hard
  • When to use: RARELY. Only if you are writing a language binding (like making a Python library to talk to GPU) or need manual context management.
  • Advice: Stick to cudart.

3. nvrtc (NVRTC - Runtime Compilation)

  • What is it?: A compiler library that lets you compile CUDA C++ code strings into PTX (assembly) while your program is running.
  • Difficulty: ⭐⭐⭐ Advanced
  • When to use: If you are building an app where the user defines the math logic at runtime (like a database query engine or Python's Numba/CuPy).

4. cudadevrt (CUDA Device Runtime)

  • What is it?: Enables Dynamic Parallelism (launching new kernels from inside a kernel).
  • Difficulty: ⭐⭐ Intermediate
  • When to use: When your algorithm is recursive (like Graph traversal or Quadtrees) and you don't know how much work is needed until you are already on the GPU.

🧮 Math & Linear Algebra (The "cu" Suite)

5. cuBLAS (Basic Linear Algebra Subprograms)

  • What is it?: The GPU standard for Matrix Multiplication (GEMM) and vector math.
  • Difficulty: ⭐ Beginner/Intermediate
  • When to use: Anytime you need Matrix x Matrix or Matrix x Vector. Do not write your own MatMul kernel unless it's for learning; cuBLAS is hand-tuned by wizards.
  • How: cublasCreate(), cublasSgemm().

6. cuBLASLt (Lightweight)

  • What is it?: A more flexible, lower-level version of cuBLAS for advanced matrix math (INT8, FP8, mixed precision).
  • Difficulty: ⭐⭐⭐ Advanced
  • When to use: If you are optimizing large LLMs or need very specific fusion/quantization support that standard cuBLAS lacks.

7. cuSPARSE

  • What is it?: Linear algebra for Sparse Matrices (matrices where most values are zero).
  • Difficulty: ⭐⭐ Intermediate
  • When to use: Physics simulations, Graph algorithms, or recommender systems where storing 0s is a waste of RAM.

8. cuSOLVER

  • What is it?: High-level linear algebra solvers (Eigenvalues, SVD, Cholesky decomposition).
  • Difficulty: ⭐⭐ Intermediate
  • When to use: Scientific computing applications needing complex matrix factorizations, not just multiplication.

9. cuFFT

  • What is it?: Fast Fourier Transform (Signal Processing).
  • Difficulty: ⭐ Beginner/Intermediate
  • When to use: Audio processing, fluid dynamics, image filtering. It is incredibly fast.

10. cuRAND

  • What is it?: Random Number Generation on GPU.
  • Difficulty: ⭐ Beginner
  • When to use: Monte Carlo simulations or initializing Neural Network weights. Generates billions of random numbers in parallel.

11. cuTENSOR

  • What is it?: High-performance tensor contractions (multi-dimensional matrix math).
  • Difficulty: ⭐⭐ Intermediate
  • When to use: If you are doing math on 4D, 5D arrays (common in physics and deep learning) and standard matrix math isn't enough.

🧠 Deep Learning & AI

12. cuDNN (CUDA Deep Neural Network library)

  • What is it?: The engine room of Deep Learning. Provides primitives like Convolutions, Pooling, Softmax, Attention.
  • Difficulty: ⭐⭐⭐ Hard
  • When to use: If you are writing your own Deep Learning framework (like PyTorch/TensorFlow) from scratch.
  • Note: Most users use PyTorch/TF, which call cuDNN for you.

13. TensorRT

  • What is it?: An Inference Optimizer. It takes a trained model and optimizes it to run as fast as possible on specific hardware.
  • Difficulty: ⭐⭐ Intermediate
  • When to use: Deploying a model to production (robotics, cloud service) where latency (ms) matters.

⚡ C++ Productivity & Abstractions

14. Thrust

  • What is it?: The "C++ STL for CUDA".
  • Difficulty: ⭐ Beginner (Easiest!)
  • When to use: Start here! If you need to Sort, Reduce (sum), Scan, or Transform vectors.
  • Why: You can write a blazing fast GPU sort in 5 lines of code.
    thrust::sort(d_vec.begin(), d_vec.end());

15. CUB (CUDA Unbound)

  • What is it?: Building blocks for writing your own kernels. Provides efficient implementation of Block/Warp level operations (WarpReduce, BlockScan).
  • Difficulty: ⭐⭐⭐ Advanced
  • When to use: When writing custom kernels but you want robust reusable components for parallel primitives inside the block.

16. CUTLASS

  • What is it?: Templates for high-performance matrix multiplication.
  • Difficulty: ⭐⭐⭐⭐ Expert
  • When to use: If you need to write a custom Matrix Multiplication layer that does something weird (like fused activation) inside the inner loop and cuBLAS doesn't support it.

🌍 Scale & Optimization

17. NCCL (NVIDIA Collective Communications Library)

  • What is it?: "Nickel". Library for multi-GPU communication.
  • Difficulty: ⭐⭐ Intermediate
  • When to use: Training heavy AI models across 8 GPUs or multiple nodes. It handles AllReduce (averaging gradients) efficiently over NVLink/Infiniband.

18. NVTX (NVIDIA Tools Extension)

  • What is it?: Profiling markers.
  • Difficulty: ⭐ Beginner
  • When to use: When profiling your app with Nsight Systems. You can "name" parts of your code (e.g., "Image Preprocessing") so they show up as named colored bars in the timeline.

🧭 Summary: Which one do I choose?

Task Recommendation
"I just want to add arrays." Use cudart (write a simple kernel) or Thrust.
"I need to sort a list." Use Thrust. Do not write a sort kernel.
"I need to multiply matrices." Use cuBLAS.
"I'm training a Neural Network." Use PyTorch/TF (which use cuDNN).
"I'm deploying a Neural Network." Use TensorRT.
"I'm profiling my code." Add NVTX markers.
"I need random numbers." Use cuRAND.

Happy Computing! 🚀