Note: CUDA build/run requires a Linux/Windows machine with an NVIDIA GPU + CUDA Toolkit. macOS is not supported as a CUDA target environment.
Minimal CUDA micro-bench utilities:
cuda_kernel_benchmark: simple kernel timing -> CSVmemcpy_benchmark: HtoD/DtoH bandwidth sweep -> CSVlatency_profiler: tiny latency (kernel launch + tiny memcpy)
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jBinaries: build/bin/
./build/bin/cuda_kernel_benchmark --out results/kernel_benchmark.csv
./build/bin/memcpy_benchmark --out results/memcpy_benchmark.csv
./build/bin/latency_profilerNotes
If build is slow, set your CUDA arch: cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86