Skip to content

TheJoshBrod/KernelForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

629 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KernelForge
Kernel Forge

Drop-in GPU kernel optimizer for PyTorch models.

CUDA Triton More coming soon

Discord License


Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.


Who is this for?

  • ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
  • AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
  • Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
  • Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
  • Teams packaging models for deployment who want a self-contained inference artifact with kernels baked in and no runtime dependency on KernelForge.

Features

  • Automated kernel generation via LLM with compile-error feedback loop
  • MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
  • CUDA and Triton backends (NVIDIA and AMD ROCm)
  • Remote execution over SSH - no local GPU required
  • Multi-LLM support: Anthropic, OpenAI, Google
  • Web dashboard with live progress, speed charts, and MCTS tree inspector
  • Portable .anvil snapshots and self-contained .cast inference packages

Full feature details


Benchmark Snapshot

Qwen 3.5 35B-A3B

On this mixed-workload run, Kernel Forge mixed latest delivered the best overall result against both PyTorch eager and torch.compile.

  • Total latency: 3693.6 ms vs 4193.3 ms for PyTorch eager and 4546.5 ms for torch.compile
  • Relative to eager throughput: 1.09x prefill tok/s, 1.14x decode tok/s, and 1.13x total tok/s
  • In this run, torch.compile slightly improved prefill (1.02x) but regressed decode (0.92x) and total throughput (0.92x) relative to eager
Qwen 3.5 35B-A3B latency breakdown Qwen 3.5 35B-A3B throughput vs PyTorch eager

Quick start

See system requirements before installing.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd frontend
jac install

Configure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.

jac start main.jac

Open http://localhost:8000. Create a project, upload your model weights, and click Start Forge.


CLI

For headless or scripted runs, see docs/cli.md.


Further reading

About

Drop-in GPU kernel optimizer for PyTorch models, no CUDA or GPU expertise required

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors