Local CUDA workbench for solving and benchmarking Tensara problems before submitting to the platform. The goal of this repo is to iterate on kernel correctness, compare alternative implementations, sweep launch configurations, and document what performs well locally before pushing toward Tensara leaderboards.
Note: This repository was developed with assistance from Codex agent ChatGPT 5.4. It generated the test suite and authored all repository code outside the kernel implementations and their underlying logic.
Tensara describes itself as a platform for GPU programming challenges: write efficient GPU kernels, benchmark them, and compare against other developers on standardized hardware. Its homepage centers the loop:
- optimize
- benchmark
- repeat
The site emphasizes:
- real hardware benchmarking
- per-problem leaderboards
- competitive GPU optimization workflows
- community discussion and iteration
Problem catalog context from the public problems page:
- The public problems page currently shows
84problems. - The catalog spans practical GPU tasks such as convolution, pooling, reduction, normalization, activation functions, matrix multiplication, graphics, cryptography, sorting, and quantization.
1D Convolutionappears in the public catalog as an easy convolution task.
What this repo is trying to do:
- build local CUDA solutions for Tensara-style problems
- test correctness against CPU references
- compare multiple kernel strategies for the same problem
- profile launch-shape choices such as block size and grid size
- keep concise notes on which approaches are worth submitting
Implements 1D same-padding convolution / cross-correlation with three kernels:
basic: direct global-memory implementationtiled: shared-memory tiled version with halo loadsbstride: shared-memory tiled version with block-stride loading
Short summary:
- GPU: NVIDIA GeForce RTX 3050 Laptop GPU, 4 GB VRAM
- Default launch used by the main benchmark rows:
block_x=256,grid_x=32 - All kernels pass correctness checks on the current small, large, tile, and
K=8191web-style cases. tiledis correct, but it is usually slower thanbasicorbstridefor larger filters.- For this problem on the local RTX 3050,
bstrideis the best shared-memory design so far. basicremains very competitive, especially at largeK.- Larger blocks (
256or512) with moderate-to-high grid counts perform best for theK=8191scaling cases. - Full benchmark dump, scaling heatmaps, and best-launch notes: P1_1D_CONVOLUTIONS_RESULTS.md
Local ReLU harness for the Tensara problem:
- Matches the Tensara signature
extern "C" void solution(const float* input, float* output, size_t n, size_t m) - Treats the input/output as row-major
m x nmatrices and appliesC[i][j] = max(0, A[i][j]) - Includes a CPU reference and a baseline GPU kernel implementation
- Default runs focus on small and medium correctness cases with CPU checking
--skip-cpuenables the heavier large/shape/scaling benchmark sweep