kernelforge/run_cast.py is the standalone runtime loader for KernelForge .cast inference packages. It requires only torch and no KernelForge installation.
python3 kernelforge/run_cast.py <file>.cast [options]
Options:
--device cuda|cpu Target device (default: cuda if available)
--runs N Number of timed inference passes (default: 5)
--no-kernels Skip kernel loading, run with native PyTorch ops
--opt-level -O0..-O3 NVCC optimisation level for JIT fallback (default: -O0)
--model-args JSON JSON config string for model instantiation
e.g. '{"model_type":"resnet","num_labels":1000}'
Used when .cast has no model_config.json
| Priority | Condition | Action |
|---|---|---|
| 1 | compiled/sm_XX/<op>.so in archive for current GPU |
importlib dlopen, no NVCC |
| 2 | kernel.cu present |
JIT compile via load_inline (NVCC, cached in build/) |
| 3 | Neither | Warning, op skipped, native PyTorch used |
JIT compilation uses load_inline (not load()): splits the kernel into a tiny C++ host declaration and the full CUDA device source. This is significantly lower on peak memory than compiling the whole .cu as one translation unit, which matters in memory-constrained environments.
- Op patching is hardcoded per op name. A generic dispatch mechanism via
torch.libraryunder thecast::namespace is the intended next step. - Precompiled binaries are SM-specific. A
.castexported on sm_75 falls back to JIT on sm_80. Bundle multiple SM targets by exporting from different GPUs. loader.pyinside the archive is a stub (reserved forzipimport-based loading without installingkernelforge/run_cast.py).wrapper.pyinside the archive is a stub (reserved for a futuretorch.library-based dispatch wrapper).
See docs/FileFormat.md for the full .cast archive layout and schema.