Skip to content

Benchmark idea: tinygrad as a "compiler inner loop" workload #739

@kimjune01

Description

@kimjune01

Hi all,

I've been investigating CPython performance on tinygrad's compiler, and I think it might be a useful addition to pyperformance as a benchmark — it exercises a code shape that the current suite doesn't cover.

tinygrad is a small (~10K line) deep learning framework that compiles neural network graphs to GPU kernels at runtime. The compiler is pure Python — no C extensions. The hot function is unified_rewrite (source), a ~100-line while loop that does dict.get, deque.pop, set.__contains__, tuple() construction, and callback dispatch on an in-memory graph. It accounts for about 68% of compilation time.

This "tight loop over dict/deque/set" pattern shows up in a lot of Python tools — type checkers, linters, code generators — but nothing in pyperformance currently stresses it. The existing benchmarks are mostly string processing (2to3, html5lib), I/O (dulwich, json), or numerical (nbody).

One thing that makes tinygrad interesting as a benchmark target: a Cython transpile of just unified_rewrite (same algorithm, no type annotations) gives -7.3% end-to-end. The tier 2 JIT on 3.16 creates substantial traces for it (up to 393 uops, 15+ executors) but shows 0% improvement. So there's a measurable gap between "same code compiled to C" and "JIT-compiled" that doesn't surface on the current benchmarks.

Easy to reproduce, no GPU needed:

pip install tinygrad
python -c "
import time
from tinygrad import Tensor
for _ in range(5):
    Tensor.randn(1,3,32,32).conv2d(Tensor.randn(16,3,3,3)).relu().conv2d(Tensor.randn(32,16,3,3)).relu().reshape(1,-1).matmul(Tensor.randn(32*28*28,128)).relu().matmul(Tensor.randn(128,10)).realize()
N=50; t0=time.perf_counter()
for _ in range(N):
    Tensor.randn(1,3,32,32).conv2d(Tensor.randn(16,3,3,3)).relu().conv2d(Tensor.randn(32,16,3,3)).relu().reshape(1,-1).matmul(Tensor.randn(32*28*28,128)).relu().matmul(Tensor.randn(128,10)).realize()
print(f'{(time.perf_counter()-t0)/N*1000:.2f}ms/iter')
"

Full investigation (28 hypotheses, including the Cython/JIT comparison): kimjune01/tinygrad-experiments

Tested on macOS 15.5, Apple M4 Max, CPython 3.16.0a0 (d36e5b8), --enable-experimental-jit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions