Kernel Forge

Drop-in GPU kernel optimizer for PyTorch models.

Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.

Who is this for?

ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
Teams packaging models for deployment who want a self-contained inference artifact with kernels baked in and no runtime dependency on KernelForge.

Features

Automated kernel generation via LLM with compile-error feedback loop
MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
CUDA and Triton backends (NVIDIA and AMD ROCm)
Remote execution over SSH - no local GPU required
Multi-LLM support: Anthropic, OpenAI, Google
Web dashboard with live progress, speed charts, and MCTS tree inspector
Portable .anvil snapshots and self-contained .cast inference packages

Full feature details

Benchmark Snapshot

Qwen 3.5 35B-A3B

On this mixed-workload run, Kernel Forge mixed latest delivered the best overall result against both PyTorch eager and torch.compile.

Total latency: 3693.6 ms vs 4193.3 ms for PyTorch eager and 4546.5 ms for torch.compile
Relative to eager throughput: 1.09x prefill tok/s, 1.14x decode tok/s, and 1.13x total tok/s
In this run, torch.compile slightly improved prefill (1.02x) but regressed decode (0.92x) and total throughput (0.92x) relative to eager

Qwen 3.5 35B-A3B throughput vs PyTorch eager

Quick start

See system requirements before installing.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd frontend
jac install

Configure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.

jac start main.jac

Open http://localhost:8000. Create a project, upload your model weights, and click Start Forge.

CLI

For headless or scripted runs, see docs/cli.md.

Name		Name	Last commit message	Last commit date
Latest commit History 629 Commits
docs		docs
frontend		frontend
kernelforge		kernelforge
kernels/generated		kernels/generated
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel Forge

Who is this for?

Features

Benchmark Snapshot

Qwen 3.5 35B-A3B

Quick start

CLI

Further reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kernel Forge

Who is this for?

Features

Benchmark Snapshot

Qwen 3.5 35B-A3B

Quick start

CLI

Further reading

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages