Skip to content

Commit fa08d5f

Browse files
committed
perf: native fcmp, string append, short-circuit AND/OR, benchmark overhaul (v0.4.14)
Codegen optimizations: - Native fcmp for numeric comparisons: known-numeric operands emit Cranelift fcmp instead of js_jsvalue_compare runtime call (mandelbrot 30% faster) - compile_condition_to_bool fast path: numeric Compare produces I8 boolean directly, skipping NaN-box round-trip - Short-circuit && and || in compile_condition_to_bool: proper branching instead of always-evaluate-both with band/bor - In-place string append with capacity tracking: js_string_append reuses allocation when refcount=1 and capacity allows (string_concat 125x faster) - Deferred module-var write-back in loops: skip global stores inside simple loops, flush at exit - Method inlining for small class methods Benchmark overhaul: - Rerun all benchmarks with Node v25 + Bun 1.3 - Full README table with context for wins AND losses - Added matrix_multiply to suite runner
1 parent 8a35622 commit fa08d5f

26 files changed

Lines changed: 1148 additions & 115 deletions

File tree

CLAUDE.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,12 @@ Projects can list npm packages to compile natively instead of routing to V8. Con
143143
### v0.4.14
144144
- fix: Linux linker no longer requires PulseAudio for non-UI programs — `-lpulse-simple -lpulse` moved behind `needs_ui` guard (GH-8)
145145
- fix: `perry run .` now works — positional args parsed flexibly so non-platform values are treated as input path instead of erroring
146+
- perf: native `fcmp` for numeric comparisons — known-numeric operands emit Cranelift `fcmp` instead of `js_jsvalue_compare` runtime call; mandelbrot 30% faster
147+
- perf: `compile_condition_to_bool` fast path — numeric `Compare` in loop/if conditions produces I8 boolean directly, skipping NaN-box round-trip
148+
- perf: in-place string append with capacity tracking — `js_string_append` reuses allocation when refcount=1 and capacity allows; string_concat 125x faster
149+
- perf: deferred module-var write-back in loops — skip global stores inside simple loops, flush at exit
150+
- perf: short-circuit `&&`/`||` in `compile_condition_to_bool` — proper branching instead of always-evaluate-both with `band`/`bor`
151+
- chore: rerun all benchmarks with Node v25 + Bun 1.3, add Bun to all entries, full README with context for wins AND losses
146152

147153
### v0.4.13
148154
- fix: VStack/HStack use GravityAreas distribution + top/leading gravity — children pack from top-left instead of stretching or centering

Cargo.lock

Lines changed: 24 additions & 24 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Perry is a native TypeScript compiler written in Rust. It takes your TypeScript and compiles it straight to native executables — no Node.js, no Electron, no browser engine. Just fast, small binaries that run anywhere.
66

7-
**Current Version:** 0.4.8 | [Website](https://perryts.com) | [Documentation](https://perryts.github.io/perry/) | [Showcase](https://perryts.com/showcase)
7+
**Current Version:** 0.4.14 | [Website](https://perryts.com) | [Documentation](https://perryts.github.io/perry/) | [Showcase](https://perryts.com/showcase)
88

99
```bash
1010
perry compile src/main.ts -o myapp
@@ -33,23 +33,40 @@ People are building real apps with Perry today. Here are some highlights:
3333

3434
## Performance
3535

36-
*Median of 5 runs on macOS ARM64 (Apple Silicon)*
36+
*Median of 3 runs on macOS ARM64 (Apple Silicon). Node.js v25, Bun 1.3.*
3737

38-
| Benchmark | Perry | Node.js v24 | Bun 1.3 | Perry vs Node | Perry vs Bun |
39-
|-----------|-------|-------------|---------|---------------|--------------|
40-
| fibonacci | 4,848ms | 10,077ms | 5,188ms | **2.1x** | **1.1x** |
41-
| string_ops | 31ms | 56ms | 38ms | **1.8x** | **1.2x** |
42-
| array_read | 4ms | 12ms || **3.0x** ||
43-
| math_intensive | 22ms | 66ms || **3.0x** ||
44-
| object_create | 2ms | 7ms || **3.5x** ||
45-
| closure | 14ms | 63ms || **4.5x** ||
46-
| binary_trees | 3ms | 8ms || **2.7x** ||
38+
**Perry wins — function calls, recursion, array access:**
4739

48-
Perry compiles to native machine code via Cranelift — no JIT warmup, no interpreter overhead. Performance is competitive with Bun and significantly faster than Node.js on compute-heavy workloads.
40+
| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | What it tests |
41+
|-----------|-------|---------|-----|---------|--------|---------------|
42+
| fibonacci(40) | 505ms | 1,025ms | 538ms | **2.0x** | **1.1x** | Recursive function calls |
43+
| array_read | 4ms | 14ms | 18ms | **3.5x** | **4.5x** | Sequential memory access (10M elements) |
44+
| object_create | 5ms | 9ms | 7ms | **1.8x** | **1.4x** | Object allocation + field access (1M objects) |
4945

50-
> **Note:** Perry is under active development, so benchmarks are subject to change — but they usually only get better. We're continuously optimizing the codegen pipeline, and each release tends to improve performance across the board.
46+
Perry compiles to native machine code — no JIT warmup, no interpreter overhead. Function calls, recursion, and sequential memory access patterns are direct native instructions.
5147

52-
Run benchmarks yourself: `cd benchmarks && ./run_benchmarks.sh` (requires node, bun, cargo)
48+
**Competitive — within 2x of JIT runtimes:**
49+
50+
| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | What it tests |
51+
|-----------|-------|---------|-----|---------|--------|---------------|
52+
| method_calls | 16ms | 11ms | 9ms | 0.7x | 0.6x | Class method dispatch (10M calls) |
53+
| prime_sieve | 11ms | 8ms | 7ms | 0.7x | 0.6x | Sieve of Eratosthenes (boolean array + branches) |
54+
| string_concat | 7ms | 2ms | 1ms | 0.3x | 0.1x | 100K string appends (in-place with capacity) |
55+
56+
Method dispatch uses direct function calls (no vtable). String concatenation uses amortized O(1) in-place appending. V8/JSC have inline caches and rope strings that push these faster.
57+
58+
**V8/Bun lead — f64 math, SIMD-vectorizable loops:**
59+
60+
| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | Why they're faster |
61+
|-----------|-------|---------|-----|---------|--------|-------------------|
62+
| mandelbrot | 71ms | 25ms | 31ms | 0.3x | 0.4x | V8 TurboFan schedules f64 ops across 2 FPUs more aggressively than Cranelift |
63+
| matrix_multiply | 61ms | 36ms | 36ms | 0.6x | 0.6x | V8 auto-vectorizes nested loops with SIMD (NEON on ARM) |
64+
| math_intensive | 370ms | 52ms | 53ms | 0.1x | 0.1x | Harmonic series: V8 vectorizes `result += 1.0/i` across SIMD lanes |
65+
| nested_loops | 32ms | 18ms | 20ms | 0.6x | 0.6x | V8's loop optimization + SIMD for array access in nested loops |
66+
67+
V8's TurboFan JIT has decades of optimization for tight f64 loops — SIMD auto-vectorization (NEON/SSE), speculative type specialization, and aggressive instruction scheduling. Perry's Cranelift backend generates correct scalar code but doesn't yet vectorize. This is the main performance frontier for Perry's codegen.
68+
69+
Run benchmarks yourself: `cd benchmarks/suite && ./run_benchmarks.sh` (requires node, bun, cargo)
5370

5471
## Binary Size
5572

benchmarks/suite/run_benchmarks.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,8 @@ BENCHMARKS="02_loop_overhead.ts
7272
12_binary_trees.ts
7373
13_factorial.ts
7474
14_closure.ts
75-
15_mandelbrot.ts"
75+
15_mandelbrot.ts
76+
16_matrix_multiply.ts"
7677

7778
# Compile all benchmarks first
7879
echo -e "${BOLD}Compiling benchmarks with Perry...${NC}"

0 commit comments

Comments
 (0)