Personal CUDA learning repo for neural-network primitives.
The layout now follows the same broad idea as cudaJourney: keep each operation area self-contained instead of splitting the repo by artifact type.
| Path | Focus |
|---|---|
linear_algebra/gemm/ |
batched GEMM kernel, PyTorch wrapper, and runnable Python driver |
activations/elementwise/ |
elementwise activation kernels and saved result snapshots |
activations/with_bias/ |
activation kernels that include a bias input |
Each operation folder keeps related files together:
kernel.cufor the CUDA implementationwrapper.cppfor the PyTorch extension bindingtest.pyfor the local runner / validation path- optional
build/andresults/folders for checked-in artifacts
- NVIDIA GPU with a working CUDA runtime
- CUDA Toolkit with
nvcc - Python with a CUDA-enabled PyTorch install
There is still no single build system. Each operation is run from its local test.py, which compiles wrapper.cpp and kernel.cu on demand through torch.utils.cpp_extension.load().
Examples:
python linear_algebra/gemm/test.py 1 512 512 512
python activations/elementwise/test.py
python activations/with_bias/test.pyThe old structure separated kernels/, wrappers/, and py_tests/, which made each primitive span multiple folders. Grouping files by operation makes it easier to inspect, run, and extend one kernel family at a time while still keeping the repo lightweight.