Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.
- ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
- AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
- Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
- Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
- Teams packaging models for deployment who want a self-contained inference artifact with kernels baked in and no runtime dependency on KernelForge.
- Automated kernel generation via LLM with compile-error feedback loop
- MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
- CUDA and Triton backends (NVIDIA and AMD ROCm)
- Remote execution over SSH - no local GPU required
- Multi-LLM support: Anthropic, OpenAI, Google
- Web dashboard with live progress, speed charts, and MCTS tree inspector
- Portable
.anvilsnapshots and self-contained.castinference packages
On this mixed-workload run, Kernel Forge mixed latest delivered the best overall result against both PyTorch eager and torch.compile.
- Total latency:
3693.6 msvs4193.3 msfor PyTorch eager and4546.5 msfortorch.compile - Relative to eager throughput:
1.09xprefill tok/s,1.14xdecode tok/s, and1.13xtotal tok/s - In this run,
torch.compileslightly improved prefill (1.02x) but regressed decode (0.92x) and total throughput (0.92x) relative to eager
![]() |
![]() |
See system requirements before installing.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd frontend
jac installConfigure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.
jac start main.jacOpen http://localhost:8000. Create a project, upload your model weights, and click Start Forge.
For headless or scripted runs, see docs/cli.md.

