SPEC.md - M2Sim Project Specification

Project Goal

Build a cycle-accurate Apple M2 CPU simulator using the Akita simulation framework that can execute ARM64 user-space programs and predict execution time with high accuracy.

Success Criteria

Execute ARM64 user-space programs correctly (functional emulation)
Predict execution time with <20% average error across benchmarks (16.9% claimed but unverified — CI has never succeeded, see Issue #492)
Modular design: functional and timing simulation are separate
Support benchmarks in μs to ms range

Design Philosophy

Independence from MGPUSim

While M2Sim uses Akita (like MGPUSim) and draws inspiration from MGPUSim's architecture, M2Sim is not bound to follow MGPUSim's structure. Make design decisions that best fit an ARM64 CPU simulator.

Guidelines:

Choose meaningful names: If a different name is more appropriate, use it
Adapt to CPU semantics: GPU and CPU have different abstractions (no wavefronts, warps, or GPU-specific concepts)
Keep it simple: M2Sim targets single-core initially
Diverge when it makes sense: Document why you're doing it differently

What to Keep from MGPUSim:

Akita component/port patterns (they work well)
Separation of concerns (functional vs timing)
Testing practices (Ginkgo/Gomega)

When in Doubt: Ask "What would make this clearest for a CPU simulator?" — not "What does MGPUSim do?"

Milestones

High-Level Milestones

#	Milestone	Status
H1	Core simulator (decode, execute, timing, caches)	✅ COMPLETE
H2	SPEC benchmark enablement (syscalls, ELF loading, validation)	✅ COMPLETE
H3	Accuracy calibration (<20% error on microbenchmarks)	✅ COMPLETE (14.1%)
H4	Multi-core support	⬜ NOT STARTED
H5	15+ Intermediate Benchmarks (<20% average error)	⚠️ UNVERIFIED — CI never succeeded

H1: Core Simulator ✅ COMPLETE

All foundation work is done: ARM64 decode, ALU/Load/Store/Branch instructions, pipeline timing (Fetch/Decode/Execute/Memory/Writeback), cache hierarchy (L1I, L1D, L2), branch prediction, 8-wide superscalar, macro-op fusion, SIMD basics. Microbenchmark suite established with 34.2% average CPI error.

Completed sub-milestones (M1–M5, C1)

M1: Foundation — project scaffold, decoder, register file, ALU, load/store, branches
M2: Memory & control flow — syscall emulation (exit, write), flat memory, end-to-end C programs
M3: Timing model — pipeline stages, instruction timing
M4: Cache hierarchy — L1I, L1D, L2 caches with timing
M5: Advanced features — branch prediction, 8-wide superscalar, macro-op fusion, SIMD
C1: Baseline — microbenchmarks created, M2 data collected, initial error 39.8% → 34.2%

H2: SPEC Benchmark Enablement ✅ COMPLETE

Goal: Run SPEC CPU 2017 integer benchmarks end-to-end in M2Sim.

Status: All core infrastructure complete. PR #300 merged (syscall coverage), PR #315 needs merge (medium benchmarks). Ready for H3 calibration phase.

H2.1: Syscall Coverage (medium-level) ✅ COMPLETE

Complete the set of Linux syscalls needed by SPEC benchmarks.

H2.1.1: Core file I/O syscalls ✅ COMPLETE

read (63), write (64), close (57), openat (56) — all merged
FD table infrastructure — merged
fstat (80) — merged
File I/O acceptance tests — merged (PR #283)

H2.1.2: Memory management syscalls ✅ COMPLETE

brk (214) — merged
mmap (222) — merged

H2.1.3: Remaining syscalls ✅ COMPLETE

lseek (62) — merged (PR #282)
exit_group (94) — merged (PR #299)
mprotect (226) — merged (PR #300)

H2.1.4: Lower-priority syscalls ⬜ NOT STARTED (~10-20 cycles)

munmap (215) — issue #271
clock_gettime (113) — issue #274
getpid/getuid/gettid — issue #273
newfstatat (79) — may be needed by some benchmarks

H2.2: Micro & Medium Benchmarks (medium-level) 🚧 IN PROGRESS

Human guidance (issue #107): Going directly to SPEC is too large a leap. We need more microbenchmarks and medium-sized benchmarks first. SPEC simulations are long-running and must not be run by agents directly — they should run in CI (GitHub Actions) with sufficient time limits, triggered periodically (e.g., every 24 hours).

H2.2.1: Expand microbenchmark suite 🚧 NEARLY COMPLETE

Add microbenchmarks for memory access patterns (strided) — merged (PR #302)
Add microbenchmarks for instruction mix (load-heavy, store-heavy, branch-heavy) — merged (PR #302)
Add microbenchmarks for cache behavior (L1 hit, L2 hit, cache miss)
Native assembly implementations created — Diana completed all 4 benchmarks (issue #309)
Collect M2 hardware CPI data for new microbenchmarks — ready for measurement (issue #309)

H2.2.2: Medium-sized benchmarks ✅ FIRST BENCHMARK READY

Matrix multiply benchmark created — Leo completed 100x100 integer matrix multiply (PR #315, merge pending)
Create additional medium benchmarks: linked list traversal, sorting algorithms, simple parsers (future H2 extensions)
Issues #291 tracks additional medium benchmark work

H2.3: SPEC Binary Preparation (medium-level) ✅ COMPLETE

Issue #285 resolved: Workers successfully compiled ARM64 Linux ELF binaries using cross-compilation toolchain.

H2.3.1: Cross-compilation setup ✅ COMPLETE

Workers install/use ARM64 Linux cross-compiler (aarch64-linux-musl-gcc) — merged (PR #306)
Create build scripts for ARM64 Linux static ELF — merged (PR #306)
Rebuild SPEC benchmarks as ELF — merged (PR #306)

H2.3.2: Benchmark validation 🚧 IN PROGRESS (~10-20 cycles per benchmark)

548.exchange2_r — Sudoku solver, compiled as ARM64 ELF, ready for validation (issue #277)
505.mcf_r — vehicle scheduling, compiled as ARM64 ELF
541.leela_r — Go AI, minimal I/O
531.deepsjeng_r — chess engine, compiled as ARM64 ELF

Important: SPEC simulation runs must go through CI/GitHub Actions, not be run by agents directly.

H2.4: Instruction Coverage Gaps 🚧 IN PROGRESS

SPEC benchmarks will likely exercise ARM64 instructions not yet implemented. Expect to discover and fix gaps during validation (H2.3.2).

H2.4.1: SIMD/FP dispatch wiring ✅ COMPLETE

Wire FormatSIMDReg and FormatSIMDLoadStore in emulator — merged (PR #301)
VFADD, VFSUB, VFMUL now reachable through emulator dispatch

H2.4.2: Scalar floating-point instructions ⬜ NOT STARTED (~20-40 cycles)

Basic scalar FP arithmetic: FADD, FSUB, FMUL, FDIV
FP load/store: LDR/STR for S and D registers
FP moves and comparisons: FMOV, FCMP
Int↔FP conversions: SCVTF, FCVTZS
Update SUPPORTED.md with all FP instructions — blocked (issue #305, QA responsibility)

Strategy: Don't implement proactively. Attempt benchmark execution first; add scalar FP support reactively when benchmarks fail on unimplemented opcodes. SPEC integer benchmarks may not need much FP.

H3: Accuracy Calibration ✅ FRAMEWORK COMPLETE → H5: 15+ Benchmark Goal

H3 Goal Achieved: <20% average CPI error on microbenchmarks vs real M2 hardware (14.1% achieved).

Strategic Transition (Issue #433): Human-specified goal of 15+ intermediate benchmarks with <20% average error.

Important distinction (issue #354): "Simulation time" = wall-clock time to run the simulator. "Virtual time" = the predicted execution time on the simulated M2 hardware. Our accuracy target is about virtual time matching real hardware.

ACCURACY STATUS (February 11, 2026):

Achieved: 14.1% average error across 3 calibrated benchmarks (meets H3 target)

Pipeline Advancement: 8 ready PRs represent critical pathway to 15+ goal

Current calibrated benchmark status:

Benchmark	Error	Status
dependency	6.7%	✅ Production calibrated
branch	1.3%	✅ Production calibrated
memorystrided	2.0%	✅ Production calibrated

Current average: 14.1% (target <20% ✅)

Pipeline toward 15+ goal:

PolyBench integration (PR #448): +7 benchmarks ready
Intermediate benchmarks (PR #435): +4 benchmarks ready
Accuracy improvements (6 PRs ready): Validated calibration methodology

Error formula: abs(t_sim - t_real) / min(t_sim, t_real). Target: <20% average.

Accuracy journey: 39.8% (baseline) → 34.2% (C1) → 22.8% (branch penalty fix) → 17.6% (fetch-stage branch target extraction) → 14.1% (H3 TARGET ACHIEVED) → 15+ benchmarks (H5 goal)

H3.1: Calibration Infrastructure ✅ COMPLETE

H3 calibration framework deployed (PR #321 merged)
SIMD DUP + MRS system instructions implemented (PR #321)
Matrix multiply benchmark created (PR #315)
Microbenchmark ARM64 ELF compilation complete

H3.2: Fast Timing Mode & Calibration 🚧 IN PROGRESS

The full pipeline timing simulation is ~30,000x slower than emulation, making iterative calibration impractical. A "fast timing" mode approximates cycle counts using latency-weighted instruction mix without full pipeline simulation.

Status:

Fast timing engine merged (timing/pipeline/fast_timing.go — PR #361)
Instruction limit support added
Profile tool merged (cmd/profile/main.go — PR #361)
CI blockers fixed (PR #368 — gofmt + acceptance test timeout)
Root cause analysis merged (PR #367) — identifies arithmetic over-blocking as dominant error source
CPI comparison framework merged (PR #376)
Run matrix multiply with fast timing via GitHub Actions, collect CPI data (issue #359, PR #379 open)
Fix fast timing decoder: add MADD/UBFM instruction support (issue #380) — blocks matmul CPI data
Clearly label outputs: simulation speed vs virtual (predicted) time (issue #354)

Key insight from CPI comparison (PR #376): Fast timing is closer to M2 hardware on branch (4.3% error) and dependency (8.8% error) than the full pipeline (22.7% and 10.3%), confirming that the full pipeline's RAW hazard over-blocking is the primary accuracy bottleneck.

H3.3: Parameter Tuning ✅ TARGET MET

Root cause analysis complete (PR #367). All major tuning work done:

Arithmetic: 34.5% error — Accepted as in-order limitation (issue #386). WAW hazard blocking prevents co-issue. Fixing requires OOO/register renaming (future work).
Branch: 1.3% error — Fixed via fetch-stage branch target extraction (PR #393), benchmark scaling (PR #395), and fallback CPI update (PR #396). Down from 22.7%.
Dependency: 6.7% error — Improved via benchmark scaling (PR #394). Down from 10.3%.

Completed work:

Remaining work:

Document in-order pipeline accuracy limitation (issue #386)
Review PR #397 (ALU execution port limit modeling)
Multi-scale validation (64x64 → 256x256 matrix multiply)
Expand to more benchmark types beyond arithmetic/dependency/branch

H3.4: SPEC-level calibration 🚧 NEXT PRIORITY

Microbenchmark accuracy target met (14.1%). Now validate on real SPEC workloads.

Set up CI workflow for long-running SPEC benchmark timing (issue #307)
Run SPEC integer benchmarks with full pipeline timing, compare to M2 hardware
All calibrated benchmarks <30% individual error, <20% average
Fill instruction coverage gaps discovered during SPEC execution (issue #304)
Add more medium-sized benchmarks for broader coverage (issue #291)

Prerequisites: SPEC binary validation (H2.3.2) must progress — need confirmed-working ARM64 ELF binaries for at least one SPEC benchmark.

Strategy: Start with the simplest SPEC benchmark (548.exchange2_r — Sudoku solver, pure integer). Run in CI with sufficient timeout (issue #362 — no direct agent execution). Compare CPI against M2 hardware measurements.

H4: Multi-Core Support ⬜ NOT STARTED

Goal: Extend M2Sim to simulate multi-core M2 architectures with cache coherence and shared memory.

Status: Not started. Previous "strategic planning" produced analysis documents but no actual multi-core simulation code. The real work — cache coherence protocol, shared memory subsystem, inter-core communication — has not begun.

Prerequisites: H5 must be CI-verified before H4 work begins.

Required implementation (future):

Cache coherence protocol (e.g., MOESI)
Shared memory subsystem
Inter-core communication and synchronization
Multi-core timing validation
Leverage Akita's multi-component patterns for cache coherence modeling

H5: 15+ Intermediate Benchmarks ⚠️ UNVERIFIED

Goal (Issue #433): Achieve <20% average error across 15+ intermediate benchmarks from PolyBench, EmBench, and SPEC suites.

STATUS (February 12, 2026): CI accuracy workflows STILL cannot complete. Root cause identified (Issue #497).

CLAIMED RESULTS (unverified — manually committed, not CI-generated):

Benchmark Count: 18 benchmarks (11 microbenchmarks + 7 PolyBench)
Accuracy: 16.9% average error claimed
Data source: h5_accuracy_results.json was manually committed by agent (commit d413d02), NOT produced by CI

CI BLOCKER (Issue #497):

PR #494 fixed workflow configs (timeouts, runners) but did NOT fix the fundamental problem
accuracy-report.yml has cancel-in-progress: true — every agent push to main cancels the running accuracy workflow
Agents push multiple times per hour; accuracy workflow needs 60-120 minutes
Result: 30+ consecutive cancelled runs. Zero completions.
Additionally, polybench-segmented.yml references wrong ELF filenames (polybench_atax.elf vs actual atax_m2sim.elf)

FIX REQUIRED (Issue #497):

Remove cancel-in-progress: true from accuracy workflows, OR
Switch accuracy workflows to workflow_dispatch + scheduled cron trigger only
Fix PolyBench ELF filename mismatches

REQUIRED: Fix Issue #497, then wait for first successful accuracy CI run.

Enhancement Phase: Performance Optimization Framework ⏸️ PAUSED

Paused until H5 verification is complete. Enhancement work is premature when the accuracy numbers haven't been validated by CI.

Deliverables (Issue #490) ⚠️ QUALITY ISSUES

Human requested (Issue #490): docs consolidation, LaTeX paper, reproducible experiment script, README overhaul.

Status: Files exist in repo but have serious quality problems:

reproduce_experiments.py — Returns hardcoded fake data, not real simulation results (Issue #495)
paper/m2sim_micro2026.tex — Factual mismatches with code (wrong cache sizes, pipeline width), missing figures, not using real MICRO template (Issue #496)
README.md — Same factual mismatches, premature "COMPLETED" claim (Issue #496)
docs/reference/ — Mostly accurate, best quality among deliverables

Priority: Fix after CI accuracy verification completes.

Previous Initiative Lead: Alex (Issue #481)

H5.1: PolyBench Integration ✅ COMPLETE

Achievement: 7 PolyBench benchmarks operational (PR #448 MERGED February 11, 2026)

ARM64 ELF compilation complete (atax, bicg, mvt, jacobi-1d, gemm, 2mm, 3mm)
ELF loading infrastructure integrated into timing harness
Test coverage with CI integration and timeout management
Production deployment achieving goal contribution

H5.2: Calibrated Microbenchmark Suite ✅ COMPLETE

Achievement: 7 calibrated microbenchmarks with 13.3% average accuracy

Core microbenchmarks: arithmetic (9.6%), dependency (6.7%), branch (1.3%)
Memory benchmarks: memorystrided (2.0%), loadheavy (28.1%), storeheavy (11.3%)
Specialized benchmarks: branchheavy (16.1%)
Autonomous calibration methodology with hardware baseline validation

Important: This 13.3% accuracy applies only to microbenchmarks, NOT to intermediate benchmarks.

H5.3: Intermediate Benchmark Calibration ⚠️ UNVERIFIED

Claimed Results (h5_accuracy_results.json — manually committed, not CI-generated):

Overall accuracy: 16.9% error claimed
Microbenchmark accuracy: 14.4% error (11 benchmarks)
PolyBench accuracy: 20.8% error (7 benchmarks)
Benchmark count: 18 total benchmarks

Verification Blocker (Issue #492): No CI workflow has ever successfully generated these numbers. The h5_accuracy_results.json was committed by an agent, not produced by an automated CI run.

Previous Crisis Resolution (Issue #466 by Leo):

Hardware baselines corrected via multi-scale linear regression methodology (PRs #469, #470, #471)
Baselines in calibration_results.json appear reasonable (R² > 0.999)
But end-to-end CI verification never completed successfully

Final Benchmark Results:

Microbenchmarks (14.4% avg): arithmetic (9.6%), dependency (6.7%), branch (1.3%), memorystrided (10.8%), loadheavy (3.4%), storeheavy (47.4%), branchheavy (16.1%), vectorsum (29.6%), vectoradd (24.3%), reductiontree (6.1%), strideindirect (3.1%)

PolyBench (20.8% avg): atax (33.6%), bicg (29.3%), gemm (19.5%), mvt (22.6%), jacobi-1d (11.1%), 2mm (17.4%), 3mm (12.4%)

Scope

In Scope

ARM64 user-space instructions
CPU core simulation (single-core MVP, multi-core later)
Cache hierarchy
Timing prediction

Out of Scope

GPU / Neural Engine
Kernel-space execution
Full OS simulation
I/O devices beyond basic syscalls

Technical Constraints

Use Akita v4 simulation framework
Follow MGPUSim architecture patterns
Go programming language
Tests use Ginkgo/Gomega

References

Akita: https://github.com/sarchlab/akita
MGPUSim: https://github.com/sarchlab/mgpusim
ARM Architecture Reference Manual
See docs/calibration.md for timing parameter reference

FilesExpand file tree

SPEC.md

Latest commit

History