Skip to content

cc: byte-out narrowing + skip dead pinned-register saves#449

Open
bboe wants to merge 2 commits into
mainfrom
bboe/cc-narrow-eax-and-dead-edx-push
Open

cc: byte-out narrowing + skip dead pinned-register saves#449
bboe wants to merge 2 commits into
mainfrom
bboe/cc-narrow-eax-and-dead-edx-push

Conversation

@bboe
Copy link
Copy Markdown
Owner

@bboe bboe commented May 20, 2026

Summary

Two cc.py optimizations, one commit each.

Commit 1: narrow mov eax, immmov al, imm for byte out.
After peephole_fold_byte_immediate_through_local rewrites the byte-load idiom into a full-width immediate load, the only consumer is out dx, al which reads AL only. New peephole proves EAX upper bits are dead between the load and the next full {acc} clobber, then narrows. Saves 3 bytes per site in 32-bit (5 → 2), 1 byte in 16-bit. Phase 2 bails conservatively at labels, ret, and any jump, so loop-tail and trailing-port sites stay unnarrowed.

Commit 2: skip dead pinned-register saves around builtin calls.
_pinned_registers_to_save was saving every pinned register in the clobber set unconditionally, even when the pinned local hadn't been written yet — preserving garbage. New per-function pre-pass over the IR computes the may-defined set of pinned-register values at each builtin call site (loops pre-merge body stores into the entry set so the back-edge sees in-loop writes). Block-wrapped VarDecl init / Assign are recognised so the IR escape hatch doesn't leak. Scope intentionally limited to builtin calls — user-function paths stay on the conservative save-everything path.

Combined: 42286 → 42158 bytes (-128 bytes kernel-wide). ping user program shrinks 1544 → 1522 bytes.

Test plan

  • tests/unit/test_cc_codegen.py — 358 PASS (added 4 new tests: narrow narrow / kept-wider / liveness-skipped / liveness-VarDecl).
  • tests/test_cc_casts.py — 6/6.
  • tests/test_cc_bitfields.py — 10/10.
  • tests/test_cc_local_structs.py — 12/12.
  • tests/test_cc_bits.py — 110/110.
  • tests/test_cc_compatibility.py — 57/57.
  • tests/test_asm.py — 42/42 (previously failed macro_sm.asm; fixed by scoping commit 2 to builtin calls only).
  • tests/test_archive.py — 12/12 (updated ping row 1544 → 1522 to reflect the new size).
  • tests/test_kernel_archive.py — 12/12.
  • tests/test_pipeline_basic.py — PASS.
  • make_os.sh succeeds; os.bin 42286 → 42158.

🤖 Generated with Claude Code

@bboe bboe changed the title cc: narrow mov eax, imm to mov al, imm for out dx, al cc: byte-out narrowing + skip dead pinned-register saves May 20, 2026
@bboe bboe force-pushed the bboe/cc-narrow-eax-and-dead-edx-push branch 2 times, most recently from 5bc600b to 857e618 Compare May 20, 2026 17:40
bboe and others added 2 commits May 20, 2026 10:53
After peephole_fold_byte_immediate_through_local rewrites the
*(uint8_t *)&local byte-load idiom into a full-width mov eax, <imm>,
the only AX-touching consumer is out dx, al — which reads AL only.
Add peephole_narrow_acc_immediate_for_byte_out that walks forward to
confirm out dx, al is the consumer and that EAX upper bits are dead
until the next full {acc} clobber, then narrows the load.

Save: 3 bytes per site in 32-bit (mov eax, imm32 is 5 bytes; mov al,
imm8 is 2 bytes), 1 byte in 16-bit.  Phase 2 bails conservatively at
labels, ret, and jumps, so loop-tail and trailing-port sites where the
function may return without clobbering EAX stay unnarrowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_pinned_registers_to_save was saving every pinned register in the
builtin's clobber set unconditionally, even when the pinned local
hadn't been written yet — preserving garbage from caller-supplied
state.  Add a per-function pre-pass over the IR that computes the
may-defined set of pinned-register values at each builtin call site:

- Initial set: registers held by parameters (loaded by the prologue).
- On each ir.Copy / ir.BinaryOperation / ir.Index / ir.Call(dest=...)
  whose destination is a pinned local, add that local's register.
- On each ir.Block, peek at the wrapped AST node — VarDecl with init
  and Assign are stores; MemberAssign / IndexAssign / opaque AST go
  through pointers or are escape-hatched, skip.
- For each loop region (Label..backward-Jump), pre-merge body stores
  into the loop's entry set so the back-edge sees in-loop stores on
  every iteration past the first.

Scope is intentionally limited to builtin calls.  User function calls
go through a separate save-set path that the IR-only analysis can't
fully model — Block-wrapped statements that fall back to AST codegen,
pointer-aliased pinned locals, and ir.CarryBranch wrapping carry-return
callees all stay on the conservative save-everything path.

Saves: -120 byte kasm reduction kernel-wide.  Pre-loop kernel_outb /
kernel_inb call sites in drivers (ata_init, etc.) no longer wrap with
push/pop edx when the EDX-pinned local hasn't been initialised yet.
ping user program shrinks 1544 -> 1522 bytes (archive table updated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bboe bboe force-pushed the bboe/cc-narrow-eax-and-dead-edx-push branch from 857e618 to 5f0215e Compare May 20, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant