PerryTS
diff --git a/‎CLAUDE.md‎
Lines changed: 6 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 24 additions & 24 deletions b/‎Cargo.lock‎
Lines changed: 24 additions & 24 deletions
diff --git a/‎README.md‎
Lines changed: 31 additions & 14 deletions b/‎README.md‎
Lines changed: 31 additions & 14 deletions
diff --git a/‎benchmarks/suite/run_benchmarks.sh‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/suite/run_benchmarks.sh‎
Lines changed: 2 additions & 1 deletion
@@ -143,6 +143,12 @@ Projects can list npm packages to compile natively instead of routing to V8. Con
 ### v0.4.14
 - fix: Linux linker no longer requires PulseAudio for non-UI programs — `-lpulse-simple -lpulse` moved behind `needs_ui` guard (GH-8)
 - fix: `perry run .` now works — positional args parsed flexibly so non-platform values are treated as input path instead of erroring
+- perf: native `fcmp` for numeric comparisons — known-numeric operands emit Cranelift `fcmp` instead of `js_jsvalue_compare` runtime call; mandelbrot 30% faster
+- perf: `compile_condition_to_bool` fast path — numeric `Compare` in loop/if conditions produces I8 boolean directly, skipping NaN-box round-trip
+- perf: in-place string append with capacity tracking — `js_string_append` reuses allocation when refcount=1 and capacity allows; string_concat 125x faster
+- perf: deferred module-var write-back in loops — skip global stores inside simple loops, flush at exit
+- perf: short-circuit `&&`/`||` in `compile_condition_to_bool` — proper branching instead of always-evaluate-both with `band`/`bor`
+- chore: rerun all benchmarks with Node v25 + Bun 1.3, add Bun to all entries, full README with context for wins AND losses
 
 ### v0.4.13
 - fix: VStack/HStack use GravityAreas distribution + top/leading gravity — children pack from top-left instead of stretching or centering
 
@@ -4,7 +4,7 @@
 
 Perry is a native TypeScript compiler written in Rust. It takes your TypeScript and compiles it straight to native executables — no Node.js, no Electron, no browser engine. Just fast, small binaries that run anywhere.
 
-**Current Version:** 0.4.8 | [Website](https://perryts.com) | [Documentation](https://perryts.github.io/perry/) | [Showcase](https://perryts.com/showcase)
+**Current Version:** 0.4.14 | [Website](https://perryts.com) | [Documentation](https://perryts.github.io/perry/) | [Showcase](https://perryts.com/showcase)
 
 ```bash
 perry compile src/main.ts -o myapp
@@ -33,23 +33,40 @@ People are building real apps with Perry today. Here are some highlights:
 
 ## Performance
 
-*Median of 5 runs on macOS ARM64 (Apple Silicon)*
+*Median of 3 runs on macOS ARM64 (Apple Silicon). Node.js v25, Bun 1.3.*
 
-| Benchmark | Perry | Node.js v24 | Bun 1.3 | Perry vs Node | Perry vs Bun |
-|-----------|-------|-------------|---------|---------------|--------------|
-| fibonacci | 4,848ms | 10,077ms | 5,188ms | **2.1x** | **1.1x** |
-| string_ops | 31ms | 56ms | 38ms | **1.8x** | **1.2x** |
-| array_read | 4ms | 12ms | — | **3.0x** | — |
-| math_intensive | 22ms | 66ms | — | **3.0x** | — |
-| object_create | 2ms | 7ms | — | **3.5x** | — |
-| closure | 14ms | 63ms | — | **4.5x** | — |
-| binary_trees | 3ms | 8ms | — | **2.7x** | — |
+**Perry wins — function calls, recursion, array access:**
 
-Perry compiles to native machine code via Cranelift — no JIT warmup, no interpreter overhead. Performance is competitive with Bun and significantly faster than Node.js on compute-heavy workloads.
+| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | What it tests |
+|-----------|-------|---------|-----|---------|--------|---------------|
+| fibonacci(40) | 505ms | 1,025ms | 538ms | **2.0x** | **1.1x** | Recursive function calls |
+| array_read | 4ms | 14ms | 18ms | **3.5x** | **4.5x** | Sequential memory access (10M elements) |
+| object_create | 5ms | 9ms | 7ms | **1.8x** | **1.4x** | Object allocation + field access (1M objects) |
 
-> **Note:** Perry is under active development, so benchmarks are subject to change — but they usually only get better. We're continuously optimizing the codegen pipeline, and each release tends to improve performance across the board.
+Perry compiles to native machine code — no JIT warmup, no interpreter overhead. Function calls, recursion, and sequential memory access patterns are direct native instructions.
 
-Run benchmarks yourself: `cd benchmarks && ./run_benchmarks.sh` (requires node, bun, cargo)
+**Competitive — within 2x of JIT runtimes:**
+
+| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | What it tests |
+|-----------|-------|---------|-----|---------|--------|---------------|
+| method_calls | 16ms | 11ms | 9ms | 0.7x | 0.6x | Class method dispatch (10M calls) |
+| prime_sieve | 11ms | 8ms | 7ms | 0.7x | 0.6x | Sieve of Eratosthenes (boolean array + branches) |
+| string_concat | 7ms | 2ms | 1ms | 0.3x | 0.1x | 100K string appends (in-place with capacity) |
+
+Method dispatch uses direct function calls (no vtable). String concatenation uses amortized O(1) in-place appending. V8/JSC have inline caches and rope strings that push these faster.
+
+**V8/Bun lead — f64 math, SIMD-vectorizable loops:**
+
+| Benchmark | Perry | Node.js | Bun | vs Node | vs Bun | Why they're faster |
+|-----------|-------|---------|-----|---------|--------|-------------------|
+| mandelbrot | 71ms | 25ms | 31ms | 0.3x | 0.4x | V8 TurboFan schedules f64 ops across 2 FPUs more aggressively than Cranelift |
+| matrix_multiply | 61ms | 36ms | 36ms | 0.6x | 0.6x | V8 auto-vectorizes nested loops with SIMD (NEON on ARM) |
+| math_intensive | 370ms | 52ms | 53ms | 0.1x | 0.1x | Harmonic series: V8 vectorizes `result += 1.0/i` across SIMD lanes |
+| nested_loops | 32ms | 18ms | 20ms | 0.6x | 0.6x | V8's loop optimization + SIMD for array access in nested loops |
+
+V8's TurboFan JIT has decades of optimization for tight f64 loops — SIMD auto-vectorization (NEON/SSE), speculative type specialization, and aggressive instruction scheduling. Perry's Cranelift backend generates correct scalar code but doesn't yet vectorize. This is the main performance frontier for Perry's codegen.
+
+Run benchmarks yourself: `cd benchmarks/suite && ./run_benchmarks.sh` (requires node, bun, cargo)
 
 ## Binary Size
 
 
@@ -72,7 +72,8 @@ BENCHMARKS="02_loop_overhead.ts
 12_binary_trees.ts
 13_factorial.ts
 14_closure.ts
-15_mandelbrot.ts"
+15_mandelbrot.ts
+16_matrix_multiply.ts"
 
 # Compile all benchmarks first
 echo -e "${BOLD}Compiling benchmarks with Perry...${NC}"