🌌 The CUDA Ecosystem: A Hitchhiker's Guide for Beginners

Welcome to the vast universe of NVIDIA's CUDA ecosystem. When starting out, it's easy to get overwhelmed by the alphabet soup of libraries (cuBLAS, cuDNN, NCCL, Thrust...).

The Golden Rule: Don't write a kernel if a library already does it faster.

This guide explains What each library is, When to use it, and How it fits into your journey.

🏗️ Core APIs: The Foundation

1. cudart (CUDA Runtime API)

What is it?: The standard way to program CUDA. Functions like cudaMalloc, cudaMemcpy, and kernel launches <<<...>>>.
Difficulty: ⭐ Beginner
When to use: ALWAYS. This is your default starting point for 99% of projects.
How: Include <cuda_runtime.h>. (Already verified in this repo's examples!).

2. cuda (CUDA Driver API)

What is it?: A lower-level, C-style API that talks directly to the driver (cuCtxCreate, cuMemAlloc).
Difficulty: ⭐⭐⭐ Hard
When to use: RARELY. Only if you are writing a language binding (like making a Python library to talk to GPU) or need manual context management.
Advice: Stick to cudart.

3. nvrtc (NVRTC - Runtime Compilation)

What is it?: A compiler library that lets you compile CUDA C++ code strings into PTX (assembly) while your program is running.
Difficulty: ⭐⭐⭐ Advanced
When to use: If you are building an app where the user defines the math logic at runtime (like a database query engine or Python's Numba/CuPy).

4. cudadevrt (CUDA Device Runtime)

What is it?: Enables Dynamic Parallelism (launching new kernels from inside a kernel).
Difficulty: ⭐⭐ Intermediate
When to use: When your algorithm is recursive (like Graph traversal or Quadtrees) and you don't know how much work is needed until you are already on the GPU.

🧮 Math & Linear Algebra (The "cu" Suite)

5. cuBLAS (Basic Linear Algebra Subprograms)

What is it?: The GPU standard for Matrix Multiplication (GEMM) and vector math.
Difficulty: ⭐ Beginner/Intermediate
When to use: Anytime you need Matrix x Matrix or Matrix x Vector. Do not write your own MatMul kernel unless it's for learning; cuBLAS is hand-tuned by wizards.
How: cublasCreate(), cublasSgemm().

6. cuBLASLt (Lightweight)

What is it?: A more flexible, lower-level version of cuBLAS for advanced matrix math (INT8, FP8, mixed precision).
Difficulty: ⭐⭐⭐ Advanced
When to use: If you are optimizing large LLMs or need very specific fusion/quantization support that standard cuBLAS lacks.

7. cuSPARSE

What is it?: Linear algebra for Sparse Matrices (matrices where most values are zero).
Difficulty: ⭐⭐ Intermediate
When to use: Physics simulations, Graph algorithms, or recommender systems where storing 0s is a waste of RAM.

8. cuSOLVER

What is it?: High-level linear algebra solvers (Eigenvalues, SVD, Cholesky decomposition).
Difficulty: ⭐⭐ Intermediate
When to use: Scientific computing applications needing complex matrix factorizations, not just multiplication.

9. cuFFT

What is it?: Fast Fourier Transform (Signal Processing).
Difficulty: ⭐ Beginner/Intermediate
When to use: Audio processing, fluid dynamics, image filtering. It is incredibly fast.

10. cuRAND

What is it?: Random Number Generation on GPU.
Difficulty: ⭐ Beginner
When to use: Monte Carlo simulations or initializing Neural Network weights. Generates billions of random numbers in parallel.

11. cuTENSOR

What is it?: High-performance tensor contractions (multi-dimensional matrix math).
Difficulty: ⭐⭐ Intermediate
When to use: If you are doing math on 4D, 5D arrays (common in physics and deep learning) and standard matrix math isn't enough.

🧠 Deep Learning & AI

12. cuDNN (CUDA Deep Neural Network library)

What is it?: The engine room of Deep Learning. Provides primitives like Convolutions, Pooling, Softmax, Attention.
Difficulty: ⭐⭐⭐ Hard
When to use: If you are writing your own Deep Learning framework (like PyTorch/TensorFlow) from scratch.
Note: Most users use PyTorch/TF, which call cuDNN for you.

13. TensorRT

What is it?: An Inference Optimizer. It takes a trained model and optimizes it to run as fast as possible on specific hardware.
Difficulty: ⭐⭐ Intermediate
When to use: Deploying a model to production (robotics, cloud service) where latency (ms) matters.

⚡ C++ Productivity & Abstractions

14. Thrust

What is it?: The "C++ STL for CUDA".
Difficulty: ⭐ Beginner (Easiest!)
When to use: Start here! If you need to Sort, Reduce (sum), Scan, or Transform vectors.
Why: You can write a blazing fast GPU sort in 5 lines of code.
```
thrust::sort(d_vec.begin(), d_vec.end());
```

15. CUB (CUDA Unbound)

What is it?: Building blocks for writing your own kernels. Provides efficient implementation of Block/Warp level operations (WarpReduce, BlockScan).
Difficulty: ⭐⭐⭐ Advanced
When to use: When writing custom kernels but you want robust reusable components for parallel primitives inside the block.

16. CUTLASS

What is it?: Templates for high-performance matrix multiplication.
Difficulty: ⭐⭐⭐⭐ Expert
When to use: If you need to write a custom Matrix Multiplication layer that does something weird (like fused activation) inside the inner loop and cuBLAS doesn't support it.

🌍 Scale & Optimization

17. NCCL (NVIDIA Collective Communications Library)

What is it?: "Nickel". Library for multi-GPU communication.
Difficulty: ⭐⭐ Intermediate
When to use: Training heavy AI models across 8 GPUs or multiple nodes. It handles AllReduce (averaging gradients) efficiently over NVLink/Infiniband.

18. NVTX (NVIDIA Tools Extension)

What is it?: Profiling markers.
Difficulty: ⭐ Beginner
When to use: When profiling your app with Nsight Systems. You can "name" parts of your code (e.g., "Image Preprocessing") so they show up as named colored bars in the timeline.

🧭 Summary: Which one do I choose?

Task	Recommendation
"I just want to add arrays."	Use cudart (write a simple kernel) or Thrust.
"I need to sort a list."	Use Thrust. Do not write a sort kernel.
"I need to multiply matrices."	Use cuBLAS.
"I'm training a Neural Network."	Use PyTorch/TF (which use cuDNN).
"I'm deploying a Neural Network."	Use TensorRT.
"I'm profiling my code."	Add NVTX markers.
"I need random numbers."	Use cuRAND.

Happy Computing! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌌 The CUDA Ecosystem: A Hitchhiker's Guide for Beginners

🏗️ Core APIs: The Foundation

1. cudart (CUDA Runtime API)

2. cuda (CUDA Driver API)

3. nvrtc (NVRTC - Runtime Compilation)

4. cudadevrt (CUDA Device Runtime)

🧮 Math & Linear Algebra (The "cu" Suite)

5. cuBLAS (Basic Linear Algebra Subprograms)

6. cuBLASLt (Lightweight)

7. cuSPARSE

8. cuSOLVER

9. cuFFT

10. cuRAND

11. cuTENSOR

🧠 Deep Learning & AI

12. cuDNN (CUDA Deep Neural Network library)

13. TensorRT

⚡ C++ Productivity & Abstractions

14. Thrust

15. CUB (CUDA Unbound)

16. CUTLASS

🌍 Scale & Optimization

17. NCCL (NVIDIA Collective Communications Library)

18. NVTX (NVIDIA Tools Extension)

🧭 Summary: Which one do I choose?

FilesExpand file tree

CUDA_ECOSYSTEM_GUIDE.md

Latest commit

History

CUDA_ECOSYSTEM_GUIDE.md

File metadata and controls

🌌 The CUDA Ecosystem: A Hitchhiker's Guide for Beginners

🏗️ Core APIs: The Foundation

1. cudart (CUDA Runtime API)

2. cuda (CUDA Driver API)

3. nvrtc (NVRTC - Runtime Compilation)

4. cudadevrt (CUDA Device Runtime)

🧮 Math & Linear Algebra (The "cu" Suite)

5. cuBLAS (Basic Linear Algebra Subprograms)

6. cuBLASLt (Lightweight)

7. cuSPARSE

8. cuSOLVER

9. cuFFT

10. cuRAND

11. cuTENSOR

🧠 Deep Learning & AI

12. cuDNN (CUDA Deep Neural Network library)

13. TensorRT

⚡ C++ Productivity & Abstractions

14. Thrust

15. CUB (CUDA Unbound)

16. CUTLASS

🌍 Scale & Optimization

17. NCCL (NVIDIA Collective Communications Library)

18. NVTX (NVIDIA Tools Extension)

🧭 Summary: Which one do I choose?