Skip to content

Latest commit

 

History

History
380 lines (277 loc) · 20.6 KB

File metadata and controls

380 lines (277 loc) · 20.6 KB

SPEC.md - M2Sim Project Specification

Project Goal

Build a cycle-accurate Apple M2 CPU simulator using the Akita simulation framework that can execute ARM64 user-space programs and predict execution time with high accuracy.

Success Criteria

  • Execute ARM64 user-space programs correctly (functional emulation)
  • Predict execution time with <20% average error across benchmarks (16.9% claimed but unverified — CI has never succeeded, see Issue #492)
  • Modular design: functional and timing simulation are separate
  • Support benchmarks in μs to ms range

Design Philosophy

Independence from MGPUSim

While M2Sim uses Akita (like MGPUSim) and draws inspiration from MGPUSim's architecture, M2Sim is not bound to follow MGPUSim's structure. Make design decisions that best fit an ARM64 CPU simulator.

Guidelines:

  1. Choose meaningful names: If a different name is more appropriate, use it
  2. Adapt to CPU semantics: GPU and CPU have different abstractions (no wavefronts, warps, or GPU-specific concepts)
  3. Keep it simple: M2Sim targets single-core initially
  4. Diverge when it makes sense: Document why you're doing it differently

What to Keep from MGPUSim:

  • Akita component/port patterns (they work well)
  • Separation of concerns (functional vs timing)
  • Testing practices (Ginkgo/Gomega)

When in Doubt: Ask "What would make this clearest for a CPU simulator?" — not "What does MGPUSim do?"

Milestones

High-Level Milestones

# Milestone Status
H1 Core simulator (decode, execute, timing, caches) ✅ COMPLETE
H2 SPEC benchmark enablement (syscalls, ELF loading, validation) ✅ COMPLETE
H3 Accuracy calibration (<20% error on microbenchmarks) ✅ COMPLETE (14.1%)
H4 Multi-core support ⬜ NOT STARTED
H5 15+ Intermediate Benchmarks (<20% average error) 18 benchmarks with error data; 61.71% avg error (all CI-verified)

H1: Core Simulator ✅ COMPLETE

All foundation work is done: ARM64 decode, ALU/Load/Store/Branch instructions, pipeline timing (Fetch/Decode/Execute/Memory/Writeback), cache hierarchy (L1I, L1D, L2), branch prediction, 8-wide superscalar, macro-op fusion, SIMD basics. Microbenchmark suite established with 34.2% average CPI error.

Completed sub-milestones (M1–M5, C1)
  • M1: Foundation — project scaffold, decoder, register file, ALU, load/store, branches
  • M2: Memory & control flow — syscall emulation (exit, write), flat memory, end-to-end C programs
  • M3: Timing model — pipeline stages, instruction timing
  • M4: Cache hierarchy — L1I, L1D, L2 caches with timing
  • M5: Advanced features — branch prediction, 8-wide superscalar, macro-op fusion, SIMD
  • C1: Baseline — microbenchmarks created, M2 data collected, initial error 39.8% → 34.2%

H2: SPEC Benchmark Enablement ✅ COMPLETE

Goal: Run SPEC CPU 2017 integer benchmarks end-to-end in M2Sim.

Status: All core infrastructure complete. PR #300 merged (syscall coverage), PR #315 needs merge (medium benchmarks). Ready for H3 calibration phase.

H2.1: Syscall Coverage (medium-level) ✅ COMPLETE

Complete the set of Linux syscalls needed by SPEC benchmarks.

H2.1.1: Core file I/O syscalls ✅ COMPLETE
  • read (63), write (64), close (57), openat (56) — all merged
  • FD table infrastructure — merged
  • fstat (80) — merged
  • File I/O acceptance tests — merged (PR #283)
H2.1.2: Memory management syscalls ✅ COMPLETE
  • brk (214) — merged
  • mmap (222) — merged
H2.1.3: Remaining syscalls ✅ COMPLETE
  • lseek (62) — merged (PR #282)
  • exit_group (94) — merged (PR #299)
  • mprotect (226) — merged (PR #300)
H2.1.4: Lower-priority syscalls ⬜ NOT STARTED (~10-20 cycles)
  • munmap (215) — issue #271
  • clock_gettime (113) — issue #274
  • getpid/getuid/gettid — issue #273
  • newfstatat (79) — may be needed by some benchmarks

H2.2: Micro & Medium Benchmarks (medium-level) 🚧 IN PROGRESS

Human guidance (issue #107): Going directly to SPEC is too large a leap. We need more microbenchmarks and medium-sized benchmarks first. SPEC simulations are long-running and must not be run by agents directly — they should run in CI (GitHub Actions) with sufficient time limits, triggered periodically (e.g., every 24 hours).

H2.2.1: Expand microbenchmark suite 🚧 NEARLY COMPLETE
  • Add microbenchmarks for memory access patterns (strided) — merged (PR #302)
  • Add microbenchmarks for instruction mix (load-heavy, store-heavy, branch-heavy) — merged (PR #302)
  • Add microbenchmarks for cache behavior (L1 hit, L2 hit, cache miss)
  • Native assembly implementations created — Diana completed all 4 benchmarks (issue #309)
  • Collect M2 hardware CPI data for new microbenchmarks — ready for measurement (issue #309)
H2.2.2: Medium-sized benchmarks ✅ FIRST BENCHMARK READY
  • Matrix multiply benchmark created — Leo completed 100x100 integer matrix multiply (PR #315, merge pending)
  • Create additional medium benchmarks: linked list traversal, sorting algorithms, simple parsers (future H2 extensions)
  • Issues #291 tracks additional medium benchmark work

H2.3: SPEC Binary Preparation (medium-level) ✅ COMPLETE

Issue #285 resolved: Workers successfully compiled ARM64 Linux ELF binaries using cross-compilation toolchain.

H2.3.1: Cross-compilation setup ✅ COMPLETE
  • Workers install/use ARM64 Linux cross-compiler (aarch64-linux-musl-gcc) — merged (PR #306)
  • Create build scripts for ARM64 Linux static ELF — merged (PR #306)
  • Rebuild SPEC benchmarks as ELF — merged (PR #306)
H2.3.2: Benchmark validation 🚧 IN PROGRESS (~10-20 cycles per benchmark)
  • 548.exchange2_r — Sudoku solver, compiled as ARM64 ELF, ready for validation (issue #277)
  • 505.mcf_r — vehicle scheduling, compiled as ARM64 ELF
  • 541.leela_r — Go AI, minimal I/O
  • 531.deepsjeng_r — chess engine, compiled as ARM64 ELF

Important: SPEC simulation runs must go through CI/GitHub Actions, not be run by agents directly.

H2.4: Instruction Coverage Gaps 🚧 IN PROGRESS

SPEC benchmarks will likely exercise ARM64 instructions not yet implemented. Expect to discover and fix gaps during validation (H2.3.2).

H2.4.1: SIMD/FP dispatch wiring ✅ COMPLETE
  • Wire FormatSIMDReg and FormatSIMDLoadStore in emulator — merged (PR #301)
  • VFADD, VFSUB, VFMUL now reachable through emulator dispatch
H2.4.2: Scalar floating-point instructions ⬜ NOT STARTED (~20-40 cycles)
  • Basic scalar FP arithmetic: FADD, FSUB, FMUL, FDIV
  • FP load/store: LDR/STR for S and D registers
  • FP moves and comparisons: FMOV, FCMP
  • Int↔FP conversions: SCVTF, FCVTZS
  • Update SUPPORTED.md with all FP instructions — blocked (issue #305, QA responsibility)

Strategy: Don't implement proactively. Attempt benchmark execution first; add scalar FP support reactively when benchmarks fail on unimplemented opcodes. SPEC integer benchmarks may not need much FP.


H3: Accuracy Calibration ✅ FRAMEWORK COMPLETE → H5: 15+ Benchmark Goal

H3 Goal Achieved: <20% average CPI error on microbenchmarks vs real M2 hardware (14.1% achieved).

Strategic Transition (Issue #433): Human-specified goal of 15+ intermediate benchmarks with <20% average error.

Important distinction (issue #354): "Simulation time" = wall-clock time to run the simulator. "Virtual time" = the predicted execution time on the simulated M2 hardware. Our accuracy target is about virtual time matching real hardware.

ACCURACY STATUS (February 16, 2026):

Current state: 46.21% average error across 16 benchmarks (11 microbenchmarks + 5 PolyBench). Does NOT meet <20% target.

Pipeline changes (PRs #65-74): Branch relaxation for OoO-style loop overlap, speculative store blocking, register checkpoint, AfterBranch clearing revert, ALU→Load address forwarding. These changes improved PolyBench accuracy but regressed microbenchmarks (memorystrided: 10.8% → 253.1%).

Current calibrated benchmark status:

Benchmark Error Status
dependency 7.2% ✅ Calibrated
branch 0.6% ✅ Calibrated
memorystrided 253.1% ❌ Regressed (pipeline changes)

Original 3-benchmark average: 86.9% (was 14.1% before PRs #65-74)

Error formula: abs(t_sim - t_real) / min(t_sim, t_real). Target: <20% average.

Accuracy journey: 39.8% (baseline) → 34.2% (C1) → 22.8% (branch penalty fix) → 17.6% (fetch-stage branch target extraction) → 14.1% (H3 target achieved) → 46.2% (after pipeline rework PRs #65-74, 16 benchmarks)

H3.1: Calibration Infrastructure ✅ COMPLETE

  • H3 calibration framework deployed (PR #321 merged)
  • SIMD DUP + MRS system instructions implemented (PR #321)
  • Matrix multiply benchmark created (PR #315)
  • Microbenchmark ARM64 ELF compilation complete

H3.2: Fast Timing Mode & Calibration 🚧 IN PROGRESS

The full pipeline timing simulation is ~30,000x slower than emulation, making iterative calibration impractical. A "fast timing" mode approximates cycle counts using latency-weighted instruction mix without full pipeline simulation.

Status:

  • Fast timing engine merged (timing/pipeline/fast_timing.go — PR #361)
  • Instruction limit support added
  • Profile tool merged (cmd/profile/main.go — PR #361)
  • CI blockers fixed (PR #368 — gofmt + acceptance test timeout)
  • Root cause analysis merged (PR #367) — identifies arithmetic over-blocking as dominant error source
  • CPI comparison framework merged (PR #376)
  • Run matrix multiply with fast timing via GitHub Actions, collect CPI data (issue #359, PR #379 open)
  • Fix fast timing decoder: add MADD/UBFM instruction support (issue #380) — blocks matmul CPI data
  • Clearly label outputs: simulation speed vs virtual (predicted) time (issue #354)

Key insight from CPI comparison (PR #376): Fast timing is closer to M2 hardware on branch (4.3% error) and dependency (8.8% error) than the full pipeline (22.7% and 10.3%), confirming that the full pipeline's RAW hazard over-blocking is the primary accuracy bottleneck.

H3.3: Parameter Tuning ✅ TARGET MET

Root cause analysis complete (PR #367). All major tuning work done:

  • Arithmetic: 34.5% error — Accepted as in-order limitation (issue #386). WAW hazard blocking prevents co-issue. Fixing requires OOO/register renaming (future work).
  • Branch: 1.3% error — Fixed via fetch-stage branch target extraction (PR #393), benchmark scaling (PR #395), and fallback CPI update (PR #396). Down from 22.7%.
  • Dependency: 6.7% error — Improved via benchmark scaling (PR #394). Down from 10.3%.

Completed work:

  • Fix branch misprediction penalty (14 → 12 cycles) — PR #372
  • Root cause analysis with tuning recommendations — PR #367
  • Investigate same-cycle forwarding (PR #381, zero impact due to WAW)
  • CPI comparison framework — PR #376
  • Fix branch prediction in all fetch slots — PR #385
  • Fetch-stage branch target extraction — PR #393
  • Scale benchmarks to reduce pipeline overhead — PRs #394, #395
  • Update fallback CPIs — PR #396
  • Separate calibrated vs uncalibrated benchmarks — PR #392
  • Normalized cycles PDF chart — PR #390

Remaining work:

  • Document in-order pipeline accuracy limitation (issue #386)
  • Review PR #397 (ALU execution port limit modeling)
  • Multi-scale validation (64x64 → 256x256 matrix multiply)
  • Expand to more benchmark types beyond arithmetic/dependency/branch

H3.4: SPEC-level calibration 🚧 NEXT PRIORITY

Microbenchmark accuracy target met (14.1%). Now validate on real SPEC workloads.

  • Set up CI workflow for long-running SPEC benchmark timing (issue #307)
  • Run SPEC integer benchmarks with full pipeline timing, compare to M2 hardware
  • All calibrated benchmarks <30% individual error, <20% average
  • Fill instruction coverage gaps discovered during SPEC execution (issue #304)
  • Add more medium-sized benchmarks for broader coverage (issue #291)

Prerequisites: SPEC binary validation (H2.3.2) must progress — need confirmed-working ARM64 ELF binaries for at least one SPEC benchmark.

Strategy: Start with the simplest SPEC benchmark (548.exchange2_r — Sudoku solver, pure integer). Run in CI with sufficient timeout (issue #362 — no direct agent execution). Compare CPI against M2 hardware measurements.


H4: Multi-Core Support ⬜ NOT STARTED

Goal: Extend M2Sim to simulate multi-core M2 architectures with cache coherence and shared memory.

Status: Not started. Previous "strategic planning" produced analysis documents but no actual multi-core simulation code. The real work — cache coherence protocol, shared memory subsystem, inter-core communication — has not begun.

Prerequisites: H5 must be CI-verified before H4 work begins.

Required implementation (future):

  • Cache coherence protocol (e.g., MOESI)
  • Shared memory subsystem
  • Inter-core communication and synchronization
  • Multi-core timing validation
  • Leverage Akita's multi-component patterns for cache coherence modeling

H5: 16 Benchmarks with Error Data (February 16, 2026)

Goal: Achieve <20% average error across 15+ benchmarks with hardware CPI comparison.

STATUS: 16 benchmarks with error data (11 microbenchmarks + 5 PolyBench). Data from PR #74 CI (AfterBranch revert + ALU→Load forwarding). GEMM exceeded 5B cycle limit; 2MM timed out (55m Go test timeout).

Results:

  • Total benchmarks with error data: 16 (11 microbenchmarks + 5 PolyBench)
  • Overall average error: 46.21% — does NOT meet <20% target
  • Microbenchmark average error: 38.73% (11 benchmarks) — does NOT meet target (memorystrided regressed to 253%)
  • PolyBench average error: 62.66% (5 benchmarks) — does NOT meet target
  • Data source: h5_accuracy_results.json — CI-verified from PR #74 PolyBench run 22074740889

Microbenchmark Results (38.73% average error)

Benchmark Sim CPI HW CPI Error
arithmetic 0.219 0.296 35.16%
dependency 1.015 1.088 7.19%
branch 1.311 1.303 0.61%
memorystrided 0.750 2.648 253.07%
loadheavy 0.349 0.429 22.92%
storeheavy 0.522 0.612 17.24%
branchheavy 0.941 0.714 31.79%
vectorsum 0.362 0.402 11.05%
vectoradd 0.290 0.329 13.45%
reductiontree 0.406 0.480 18.23%
strideindirect 0.609 0.528 15.34%

PolyBench Results (62.66% average error — 5/7 benchmarks completed, 2 infeasible)

Benchmark Sim CPI HW CPI Error Dataset Status
atax 0.186 0.2185 17.5% SMALL complete
bicg 0.392 0.2295 70.8% SMALL complete
gemm 0.2332 SMALL infeasible (18.8B insts in 5B cycles)
mvt 0.279 0.2156 29.4% SMALL complete
jacobi-1d 0.349 0.1510 131.1% SMALL complete
2mm 0.1435 MINI infeasible (55m Go test timeout)
3mm 0.239 0.1453 64.5% MINI complete

Data from PR #74 CI (PolyBench run 22074740889). GEMM exceeded 5B cycle limit executing 18.8B instructions. 2MM hit Go test timeout (55m) before reaching cycle limit. Pipeline changes (ALU→Load forwarding, unconditional AfterBranch clearing) significantly reduced sim CPI compared to pre-PR#65 values.

Known Gap: Pipeline Accuracy

Key regressions: PRs #65-74 introduced branch relaxation, speculative store blocking, register checkpointing, and ALU→Load forwarding. These changes significantly improved PolyBench CPI (reduced from 0.34-0.55 to 0.19-0.39) but caused memorystrided to regress from 10.8% to 253.1% error (sim CPI dropped from 2.933 to 0.750 vs HW 2.648).

The overall error (46.21%) is dominated by memorystrided (253%) and jacobi-1d (131%). Without these two outliers, the remaining 14 benchmarks average 23.9% error.

Infeasible Benchmarks — Detailed Analysis (CI Run #22019560953)

All 6 EmBench benchmarks below are at LOCAL_SCALE_FACTOR=1, CPU_MHZ=1 (minimum workload). They all hit the 5B cycle limit and were skipped in CI.

Benchmark Category Insts at 5B Cycles CPI at Limit Wall Time Reduction Possible?
crc32 embench 12,499,991,230 0.400 44m29s No — single iteration of 1024 CRC ops
edn embench 13,000,000,936 0.385 45m10s Maybe — N=100, ORDER=50 (needs rebuild)
statemate embench 9,210,526,236 0.543 35m17s No — complex state machine, no size knob
primecount embench 9,999,999,968 0.500 42m49s No — already at SZ=3, NPRIMES=9 (minimum)
huffbench embench CI timeout (2h30m) 2h30m Maybe — TEST_SIZE=500 (needs rebuild)
matmult-int embench CI timeout (never started) Maybe — UPPERLIMIT=20 (needs rebuild)

Key finding: Even at minimum workload, these benchmarks execute 9-13 billion instructions before completion, far exceeding the 5B cycle limit. The working EmBench benchmark (aha_mont64) completes in just 4,378 instructions / 1,518 cycles — roughly 6 orders of magnitude smaller. The infeasible benchmarks' workloads are inherently large even at scale factor 1.

Benchmarks with potential further reduction (edn, huffbench, matmult-int) would require a RISC-V cross-compiler to rebuild the ELF binaries with smaller parameters.

Previously infeasible, now resolved: 3mm (PolyBench) was rebuilt with MINI_DATASET and completes successfully (CPI 0.239).

Newly infeasible after PRs #65-74: GEMM exceeded 5B cycle limit (18.8B insts at CPI ~0.27); 2MM hit 55m Go test timeout before reaching 5B cycles. Both were previously completing successfully.

Full Benchmark Coverage Table

Benchmark Category Status Sim CPI HW CPI Error Notes
arithmetic microbenchmark complete 0.219 0.296 35.16% Regressed from 9.6%
dependency microbenchmark complete 1.015 1.088 7.19%
branch microbenchmark complete 1.311 1.303 0.61%
memorystrided microbenchmark complete 0.750 2.648 253.07% Regressed from 10.8%
loadheavy microbenchmark complete 0.349 0.429 22.92%
storeheavy microbenchmark complete 0.522 0.612 17.24%
branchheavy microbenchmark complete 0.941 0.714 31.79% Regressed from 16.1%
vectorsum microbenchmark complete 0.362 0.402 11.05%
vectoradd microbenchmark complete 0.290 0.329 13.45%
reductiontree microbenchmark complete 0.406 0.480 18.23%
strideindirect microbenchmark complete 0.609 0.528 15.34%
atax polybench complete 0.186 0.2185 17.47% Improved from 105.9%
bicg polybench complete 0.392 0.2295 70.81% Improved from 104.8%
gemm polybench infeasible 0.2332 18.8B insts in 5B cycles; was 87.4%
mvt polybench complete 0.279 0.2156 29.41% Improved from 119.9%
jacobi-1d polybench complete 0.349 0.1510 131.13% Improved from 263.6%
2mm polybench infeasible 0.1435 MINI dataset; 55m Go test timeout
3mm polybench complete 0.239 0.1453 64.49% MINI dataset; improved from 136.1%
crc32 embench infeasible 12.5B insts in 5B cycles; min workload
edn embench infeasible 13.0B insts in 5B cycles; N/ORDER reducible
statemate embench infeasible 9.2B insts in 5B cycles; no size knob
primecount embench infeasible 10.0B insts in 5B cycles; SZ=3 is minimum
huffbench embench infeasible CI timeout after 2h30m; memory-intensive loop; TEST_SIZE reducible
matmult-int embench infeasible CI timeout; never started; UPPERLIMIT reducible

Scope

In Scope

  • ARM64 user-space instructions
  • CPU core simulation (single-core MVP, multi-core later)
  • Cache hierarchy
  • Timing prediction

Out of Scope

  • GPU / Neural Engine
  • Kernel-space execution
  • Full OS simulation
  • I/O devices beyond basic syscalls

Technical Constraints

  • Use Akita v4 simulation framework
  • Follow MGPUSim architecture patterns
  • Go programming language
  • Tests use Ginkgo/Gomega

References