From 6b1d17d56da1e633f8bf8309e41acdc84713f64f Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Mon, 18 May 2026 01:06:39 +0000
Subject: [PATCH] sync: add documentation for cache hierarchy, SCU, Embench,
 and Goban configs

Co-authored-by: Shiroha <whmio0115@gmail.com>
---
 .../Cache Hierarchy and Private Data Cache.md | 200 +++++++++++++
 .../Goban Multi-Core Architecture.md          |  23 +-
 .../en/ToolChain Guide/System Control Unit.md | 256 ++++++++++++++++
 .../en/Tutorial/Embench Benchmark Suite.md    | 279 ++++++++++++++++++
 .../Cache Hierarchy and Private Data Cache.md | 200 +++++++++++++
 .../Goban Multi-Core Architecture.md          |  23 +-
 .../zh/ToolChain Guide/System Control Unit.md | 256 ++++++++++++++++
 .../zh/Tutorial/Embench Benchmark Suite.md    | 279 ++++++++++++++++++
 8 files changed, 1504 insertions(+), 12 deletions(-)
 create mode 100644 content/en/Architecture/Cache Hierarchy and Private Data Cache.md
 create mode 100644 content/en/ToolChain Guide/System Control Unit.md
 create mode 100644 content/en/Tutorial/Embench Benchmark Suite.md
 create mode 100644 content/zh/Architecture/Cache Hierarchy and Private Data Cache.md
 create mode 100644 content/zh/ToolChain Guide/System Control Unit.md
 create mode 100644 content/zh/Tutorial/Embench Benchmark Suite.md

diff --git a/content/en/Architecture/Cache Hierarchy and Private Data Cache.md b/content/en/Architecture/Cache Hierarchy and Private Data Cache.md
new file mode 100644
index 0000000..4c45503
--- /dev/null
+++ b/content/en/Architecture/Cache Hierarchy and Private Data Cache.md	
@@ -0,0 +1,200 @@
+# Cache Hierarchy and Private Data Cache
+
+## Overview
+
+Buckyball employs a configurable cache hierarchy to optimize memory access patterns for diverse workloads. Each tile supports private per-core instruction and data caches, with optional per-tile inclusive L2 caches. The recent redesign of the private data cache (dcache) simplifies coherency semantics when private caches are present, shifting memory management responsibility to software.
+
+## Cache Architecture
+
+### Core-Level Caches
+
+Each Rocket core in a tile has:
+
+- **Instruction Cache (I-Cache)**: Private, read-only, typically 16–32 KB
+- **Data Cache (D-Cache)**: Private, read-write, typically 16–32 KB
+
+Both caches use write-through semantics to the L1 miss handling logic.
+
+### Per-Tile Inclusive L2 Cache
+
+An optional per-tile L2 cache sits between the core's L1 caches and the system interconnect. The L2 is **inclusive** — it holds a superset of L1 cache lines.
+
+**L2 Configuration Parameters**:
+
+- **ways**: Cache associativity (typically 8–16)
+- **sets**: Number of sets per way (typically 256–512)
+- **writeBytes**: Write buffer depth
+- **portFactor**: Memory port provisioning factor
+- **memCycles**: Estimated memory latency for performance modeling
+
+**Topology with L2**:
+
+```
+Core L1 I-Cache → ──┐
+Core L1 D-Cache → ──┤ Inclusive L2 → Cork Unit → System Memory
+                  ──┘
+```
+
+The cork unit (TLCacheCork) manages coherency between the L2 and other system agents.
+
+### System-Level Coherency
+
+When a tile has a private L2 cache, the system disables distributed coherency for that tile's memory domain. This is because:
+
+1. The inclusive L2 acts as a **single point of coherency** for all L1 accesses from cores in that tile
+2. External agents (other tiles, DMA) cannot directly observe L1 state
+3. Software must manage coherency explicitly through memory barriers and cache flush operations
+
+This design is known as **software-managed coherency** and is appropriate for:
+
+- Workloads with static data partitioning (e.g., SPMD)
+- Systems where inter-tile communication is infrequent
+- Configurations prioritizing energy efficiency over automatic coherency
+
+## Private Data Cache Redesign
+
+### Motivation
+
+Prior Buckyball versions could maintain hardware-managed coherency even with private L1 caches. This required complex coherency protocols and increased logic overhead.
+
+The new design simplifies the tile by requiring software to handle coherency when private caches are configured. Hardware no longer maintains a last-level cache (LLC) in the coherent subsystem when tile caches are private.
+
+### Configuration Model
+
+**Private Cache Mode**:
+
+When a tile is configured with private dcache:
+
+- No LLC is present in the coherent subsystem for that tile's memory domain
+- All L1 misses flow through the tile's memory backend
+- Software must flush caches or use memory barriers to ensure coherency
+
+**Implications**:
+
+1. **Data Consistency**: Software must explicitly synchronize caches before sharing data with other tiles
+2. **Memory Barriers**: Add barriers around shared-memory access to enforce ordering
+3. **Cache Flush**: Use platform-specific cache flush instructions before publishing data
+
+### Example: Multi-Tile SPMD with Private Caches
+
+```c
+#include <stdint.h>
+
+#define BARRIER_ADDR 0x60000000  // SCU barrier address (multi-hart SCU)
+#define SHARED_DATA_ADDR 0x80100000
+
+typedef struct {
+  int tile_id;
+  int result;
+} SharedData;
+
+void scu_barrier() {
+  volatile int *barrier = (volatile int *)BARRIER_ADDR;
+  *barrier = 1;  // Write barrier address to block all harts
+}
+
+void cache_flush_dcache() {
+  // Platform-specific: flush entire D-cache
+  // Example for RISC-V with custom CSR:
+  asm volatile("fence" ::: "memory");
+}
+
+int main() {
+  int tile_id = bb_get_tile_id();
+  SharedData *shared = (SharedData *)SHARED_DATA_ADDR;
+  
+  // Phase 1: Per-tile computation
+  int local_result = compute_tile_result(tile_id);
+  
+  // Phase 2: Flush cache before publishing
+  cache_flush_dcache();
+  shared->tile_id = tile_id;
+  shared->result = local_result;
+  cache_flush_dcache();  // Ensure write is visible
+  
+  // Phase 3: Global synchronization
+  scu_barrier();
+  
+  // Phase 4: Read published data from other tiles
+  // Data is now globally visible
+  int peer_result = shared->result;
+  
+  return 0;
+}
+```
+
+## Configuration Examples
+
+### Enabling Private L2 Cache
+
+Define a custom config with L2:
+
+```scala
+import org.chipsalliance.cde.config._
+import freechips.rocketchip.subsystem._
+
+class GobanWithPrivateL2 extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig ++
+    new WithL2Cache(
+      ways = 8,
+      sets = 512,
+      writeBytes = 64,
+      portFactor = 1,
+      memCycles = 10
+    )
+)
+```
+
+### Disabling L2 (Default)
+
+By default, Goban uses only core L1 caches:
+
+```scala
+class GobanL1Only extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig
+)
+```
+
+With this configuration, each tile operates independently with software-managed inter-tile coherency.
+
+## Performance Considerations
+
+### Private Cache Benefits
+
+- **Reduced coherency overhead**: No broadcast bus for L1 evictions
+- **Predictable memory timing**: Private L2 eliminates conflict misses from other tiles
+- **Energy efficiency**: Lower coherency traffic reduces power consumption
+
+### Software Coherency Costs
+
+- **Explicit flushes**: Cache management instructions increase code size and latency
+- **False sharing**: Software must partition data carefully to avoid cache line conflicts
+- **Synchronization latency**: Barriers impose serialization points
+
+## Troubleshooting
+
+### Data Coherency Issues
+
+**Symptom**: Values written by one tile are not visible to other tiles after synchronization.
+
+**Check**:
+
+1. Verify cache flush instructions are present before publishing data
+2. Use memory barriers (`fence` in RISC-V) before and after shared-memory access
+3. Confirm synchronization point (e.g., `scu_barrier()`) is called after flush
+
+### Performance Degradation
+
+**Symptom**: Execution time increases significantly with L2 disabled compared to hardware-coherent mode.
+
+**Check**:
+
+1. Profile memory access patterns to identify frequent cache misses
+2. Consider enabling L2 if L1 miss rate is high (>10%)
+3. Verify working set fits within combined L1+L2 capacity
+
+## References
+
+- **Inclusive Cache**: SiFive InclusiveCache documentation
+- **RISC-V Memory Ordering**: RISC-V ISA specification, Chapter 8 (Memory Model)
+- **Cork Unit**: TileLink cache management in Rocket Chip
diff --git a/content/en/Architecture/Goban Multi-Core Architecture.md b/content/en/Architecture/Goban Multi-Core Architecture.md
index 9644102..c9aa9e5 100644
--- a/content/en/Architecture/Goban Multi-Core Architecture.md	
+++ b/content/en/Architecture/Goban Multi-Core Architecture.md	
@@ -22,15 +22,26 @@ Goban is a multi-core BBTile configuration in Buckyball that enables parallel ex
 
 ### Configuration Variants
 
-**BuckyballGobanConfig**
-- 1 BBTile × 4 cores
+Goban supports multiple configuration sizes:
+
+**1t4c** — 1 tile × 4 cores
 - 4 Rocket cores + 4 BuckyballAccelerators
 - Single SharedMem + BarrierUnit
+- Minimal memory footprint, suitable for single-tile testing
+
+**4t16c** — 4 tiles × 4 cores = 16 total cores
+- 16 Rocket cores + 16 BuckyballAccelerators
+- Per-tile memory domains and synchronization
+
+**8t8c** — 8 tiles × 8 cores = 64 total cores
+- 64 Rocket cores + 64 BuckyballAccelerators
+- Per-tile synchronization, scaled memory system
+
+**Legacy configurations:**
+- `BuckyballGobanConfig` — 1 BBTile × 4 cores
+- `BuckyballGoban2TileConfig` — 2 BBTiles × 4 cores = 8 total cores
 
-**BuckyballGoban2TileConfig**
-- 2 BBTiles × 4 cores = 8 total cores
-- 8 Rocket cores + 8 BuckyballAccelerators
-- Per-tile SharedMem + BarrierUnit
+All variants maintain the same per-core execution model and barrier synchronization semantics across tiles.
 
 ## Core Components
 
diff --git a/content/en/ToolChain Guide/System Control Unit.md b/content/en/ToolChain Guide/System Control Unit.md
new file mode 100644
index 0000000..d5c7adf
--- /dev/null
+++ b/content/en/ToolChain Guide/System Control Unit.md	
@@ -0,0 +1,256 @@
+# System Control Unit (SCU)
+
+## Overview
+
+The System Control Unit (SCU) is a global multi-hart device in Buckyball that provides simulation and inter-hart control functionality. Unlike earlier single-hart designs, the current SCU serves all harts in the system and is accessed through a per-hart addressable memory-mapped I/O interface.
+
+## Architecture
+
+### Global Multi-Hart Design
+
+The SCU is instantiated once at the system level and provides a unified interface for all harts. Each hart has a dedicated sub-region of the SCU address space, calculated as:
+
+```
+hart_address = baseAddress + hartId * strideBytes
+```
+
+**Parameters**:
+
+- `baseAddress`: Base address of SCU memory region (typically 0x6000_0000)
+- `strideBytes`: Per-hart address stride (must be a power of two, e.g., 0x40000)
+- `totalSizeBytes`: Total addressable SCU region (must be a power of two, e.g., 0x1000_0000)
+- `maxHarts`: Maximum number of harts the SCU supports (e.g., 64)
+
+**Validation**:
+
+```
+maxHarts * strideBytes <= totalSizeBytes
+```
+
+Addresses for hart IDs ≥ `maxHarts` fall through to the system bus unmapped address handler.
+
+### Address Space Layout
+
+For a system with `baseAddress=0x60000000`, `strideBytes=0x40000`, and `maxHarts=64`:
+
+```
+Hart 0:  0x60000000 – 0x6003FFFF
+Hart 1:  0x60040000 – 0x6007FFFF
+Hart 2:  0x60080000 – 0x600BFFFF
+...
+Hart 63: 0x6FFC0000 – 0x6FFFFFFF
+```
+
+## Functionality
+
+### UART Output
+
+Each hart can write characters to simulation UART via its SCU region:
+
+```c
+// Hart ID is automatically inferred from accessing hart's address space
+volatile uint8_t *scu_uart = (volatile uint8_t *)0x60000000;  // Hart 0
+*scu_uart = 'A';  // Write character
+
+// From hart 1:
+scu_uart = (volatile uint8_t *)0x60040000;
+*scu_uart = 'B';
+```
+
+The DPI-C bridge (`SCUWriteDPI`) receives the hart ID and character, routing output to the simulation console.
+
+### Simulation Exit
+
+Harts can terminate simulation by writing an exit code:
+
+```c
+volatile uint32_t *scu_exit = (volatile uint32_t *)0x60000004;  // Hart 0
+*scu_exit = 0;  // Exit with code 0
+```
+
+The SCU captures the hart ID and exit code, triggering simulation termination.
+
+### Barrier Synchronization (Multi-Hart)
+
+The SCU provides a per-hart barrier register. When a hart writes to its barrier address, it blocks until all participating harts have written:
+
+```c
+// Hart 0
+volatile int *scu_barrier_h0 = (volatile int *)0x60000008;
+*scu_barrier_h0 = 1;  // Block until all harts reach barrier
+
+// Hart 1 (on same tile, different hart ID)
+volatile int *scu_barrier_h1 = (volatile int *)0x60040008;
+*scu_barrier_h1 = 1;  // Block until all harts reach barrier
+```
+
+This differs from the per-tile **BarrierUnit** (used in Goban), which synchronizes only cores within a single tile.
+
+## Integration with P2E Harness
+
+The SCU is a standard component in P2E simulation configurations:
+
+```scala
+class WithSCU(
+  baseAddress:    BigInt = BigInt("60000000", 16),
+  strideBytes:    BigInt = BigInt("40000", 16),
+  totalSizeBytes: BigInt = BigInt("10000000", 16),
+  maxHarts:       Int = 64
+) extends Config(
+  new sims.scu.CanHavePeripherySCU ++
+    new chipyard.config.WithTLSimpleUART
+)
+```
+
+The configuration sets:
+
+1. SCU parameters (base, stride, total size, max harts)
+2. DigitalTop replacement to include SCU on the coherent bus (CBUS)
+3. Optional TileLink UART for character output
+
+### Elaboration into P2E
+
+The SCU is wired to the system's TileLink interconnect:
+
+```scala
+val scu = LazyModule(new TLSCU(SCUParams(...), beatBytes))
+cbus.attach(scu.node)
+```
+
+All harts on the system can access their respective SCU regions via normal TileLink reads/writes.
+
+## DPI-C Bridge
+
+The `SCUWriteDPI` Verilog module acts as a single black-box bridge for all harts:
+
+```verilog
+module SCUWriteDPI(
+  input clock,
+  input reset,
+  input [31:0] uart_hart_id,
+  input uart_valid,
+  input [7:0] uart_data,
+  input [31:0] exit_hart_id,
+  input exit_valid,
+  input [31:0] exit_code
+);
+
+  import "DPI-C" context function void scu_uart_write(
+    input int unsigned hart_id,
+    input int unsigned ch
+  );
+  
+  import "DPI-C" context function void scu_sim_exit(
+    input int unsigned hart_id,
+    input int unsigned code
+  );
+```
+
+This single module replaces per-hart DPI modules, reducing Verilog elaboration and C import duplication.
+
+## Programming Example: Multi-Hart Test
+
+```c
+#include <stdint.h>
+#include <stdio.h>
+
+#define SCU_BASE 0x60000000
+#define SCU_STRIDE 0x40000
+#define SCU_UART_OFFSET 0x00
+#define SCU_EXIT_OFFSET 0x04
+#define SCU_BARRIER_OFFSET 0x08
+
+int get_hart_id() {
+  int hart_id;
+  asm volatile("csrr %0, mhartid" : "=r"(hart_id));
+  return hart_id;
+}
+
+void scu_write_char(int hart_id, char c) {
+  volatile uint8_t *uart = (volatile uint8_t *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_UART_OFFSET);
+  *uart = c;
+}
+
+void scu_exit(int hart_id, int code) {
+  volatile uint32_t *exit = (volatile uint32_t *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_EXIT_OFFSET);
+  *exit = code;
+}
+
+void scu_barrier_wait(int hart_id) {
+  volatile int *barrier = (volatile int *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_BARRIER_OFFSET);
+  *barrier = 1;  // Block until all harts write
+}
+
+int main() {
+  int hart_id = get_hart_id();
+  
+  scu_write_char(hart_id, 'H');
+  scu_write_char(hart_id, 'i');
+  
+  scu_barrier_wait(hart_id);
+  
+  if (hart_id == 0) {
+    scu_write_char(hart_id, '\n');
+  }
+  
+  scu_exit(hart_id, 0);
+  return 0;
+}
+```
+
+## Configuration in P2E
+
+To enable SCU in a P2E configuration:
+
+```scala
+class P2EWithSCU extends Config(
+  new sims.scu.WithSCU(
+    baseAddress = BigInt("60000000", 16),
+    strideBytes = BigInt("40000", 16),
+    totalSizeBytes = BigInt("10000000", 16),
+    maxHarts = 64
+  ) ++
+    new sims.p2e.P2EBaseConfig
+)
+```
+
+## Performance Characteristics
+
+- **UART Write**: ~1–2 cycles to propagate through DPI-C callback
+- **Barrier**: ~10–20 cycles overhead depending on interconnect latency
+- **Exit**: Immediate simulation termination
+
+## Troubleshooting
+
+### Hart Cannot Find SCU Address Space
+
+**Symptom**: Load or store to SCU address results in exception.
+
+**Check**:
+
+1. Verify SCU is enabled in system configuration
+2. Confirm hart ID is < `maxHarts` parameter
+3. Calculate expected address: `baseAddress + hartId * strideBytes`
+
+### Barrier Hangs Indefinitely
+
+**Symptom**: Simulation does not advance; multiple harts blocked on barrier.
+
+**Check**:
+
+1. Verify all participating harts have reached the barrier instruction
+2. Check that hart IDs are correctly calculated and < `maxHarts`
+3. Inspect VCD waveforms to see which harts have written to barrier register
+
+### Characters Not Appearing in Console
+
+**Symptom**: UART writes to SCU do not print.
+
+**Check**:
+
+1. Verify UART is routed to simulation console (check simulation log)
+2. Confirm writes are to correct offset (`SCU_UART_OFFSET = 0x00`)
+3. Check that DPI-C bridge is properly connected in top-level Verilog
diff --git a/content/en/Tutorial/Embench Benchmark Suite.md b/content/en/Tutorial/Embench Benchmark Suite.md
new file mode 100644
index 0000000..33ffe5f
--- /dev/null
+++ b/content/en/Tutorial/Embench Benchmark Suite.md	
@@ -0,0 +1,279 @@
+# Embench Benchmark Suite
+
+## Overview
+
+Embench is a comprehensive embedded systems benchmark suite integrated into Buckyball for workload and performance analysis. It provides standardized performance tests across diverse algorithms including cryptography, compression, mathematical computation, and signal processing.
+
+## Benchmark Categories
+
+### Cryptography and Hashing
+
+- **aha-mont64**: Montgomery multiplication for elliptic curve cryptography
+- **nettle-aes**: Advanced Encryption Standard implementation
+- **nettle-sha256**: SHA-256 cryptographic hash function
+- **md5sum**: MD5 hash computation
+
+### Compression and Encoding
+
+- **huffbench**: Huffman compression algorithm
+- **slre**: Regular expression engine
+- **picojpeg**: JPEG image decoder
+
+### Mathematical and Signal Processing
+
+- **cubic**: Cubic equation solver with floating-point arithmetic
+- **nbody**: N-body physics simulation
+- **matmult-int**: Integer matrix multiplication
+- **edn**: Symbolic mathematics evaluation
+
+### Data Structures and Algorithms
+
+- **sglib-combined**: Generic sorting and data structure library
+- **statemate**: Finite state machine simulation
+- **tarfind**: Archive search algorithm
+- **qrduino**: QR code generation
+- **crc32**: Cyclic redundancy check
+- **primecount**: Prime number counting
+- **ud**: Unidirectional parsing
+- **minver**: Numerical algorithm evaluation
+- **wikisort**: Stable sorting algorithm
+- **nsichneu**: Complex numerical computation
+
+## Directory Structure
+
+```
+bb-tests/workloads/src/CTest/toy/embench/
+├── README.md                    # Embench documentation
+├── CMakeLists.txt              # Build configuration
+├── crt0.S                       # Startup code
+├── src/                         # Individual benchmark implementations
+│   ├── aha-mont64/
+│   ├── nettle-aes/
+│   ├── nettle-sha256/
+│   ├── md5sum/
+│   └── (other benchmarks)
+└── support/                     # Common utilities
+    ├── main.c                   # Unified entry point
+    ├── boardsupport.c          # Platform initialization
+    ├── chipsupport.c           # Chip-specific features
+    ├── beebsc.c                # Benchmark control interface
+    └── support.h               # Common definitions
+```
+
+## Building Embench
+
+### Prerequisites
+
+Ensure the Buckyball toolchain is properly configured:
+
+```bash
+source sourceme.sh
+```
+
+### Build All Benchmarks
+
+```bash
+cd bb-tests/workloads
+cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench
+cmake --build build
+```
+
+This generates individual benchmark binaries:
+
+```
+build/toy/embench/mont64
+build/toy/embench/aes
+build/toy/embench/sha256
+build/toy/embench/matrix-mult
+...
+```
+
+### Build Single Benchmark
+
+```bash
+cd bb-tests/workloads
+cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench -DBENCH_FILTER=sha256
+cmake --build build
+```
+
+## Running Benchmarks
+
+### Verilator Simulation
+
+```bash
+bbdev verilator --run \
+  '--binary embench/mont64-baremetal \
+    --config sims.verilator.BuckyballToyVerilatorConfig \
+    --batch'
+```
+
+### P2E Simulation
+
+```bash
+bbdev p2e --run \
+  '--binary embench/aes-baremetal \
+    --config sims.p2e.P2EToyConfig \
+    --batch'
+```
+
+## Performance Metrics
+
+Each benchmark measures:
+
+- **Cycle count**: Total cycles to completion
+- **Instructions executed**: Dynamic instruction count
+- **Memory traffic**: Load/store operations
+- **Time-to-completion**: Wall-clock time in simulation
+
+### Extracting Results
+
+Simulation output includes benchmark statistics:
+
+```
+Benchmark: sha256
+Cycles: 142857
+Instructions: 98765
+Memory ops: 12345
+```
+
+Parse these metrics to evaluate:
+
+1. **Instruction efficiency**: Instructions per cycle (IPC)
+2. **Memory efficiency**: Cache hit rates and bandwidth utilization
+3. **Compute density**: Operations per watt in post-silicon analysis
+
+## Benchmark Details
+
+### aha-mont64 (Cryptography)
+
+Montgomery multiplication for elliptic curve operations. Tests:
+- Modular arithmetic performance
+- Register pressure under heavy computation
+- Numeric stability with large integers
+
+Expected cycles: 50,000–100,000 on Buckyball Toy
+
+### nettle-aes (Encryption)
+
+AES block cipher implementation. Tests:
+- Lookup table efficiency (S-box access patterns)
+- Tight loop performance
+- Data-dependent cache behavior
+
+Expected cycles: 200,000–300,000
+
+### nettle-sha256 (Hashing)
+
+SHA-256 cryptographic hash. Tests:
+- Bitwise operation efficiency
+- Memory access patterns during state updates
+- Branch prediction with loop-heavy code
+
+Expected cycles: 80,000–150,000
+
+### matmult-int (Matrix Multiplication)
+
+Integer matrix multiplication with configurable sizes. Tests:
+- Loop nest optimization
+- Cache locality for 2D data access
+- Arithmetic pipeline utilization
+
+Expected cycles: 10,000–50,000 (size-dependent)
+
+### nbody (Physics Simulation)
+
+N-body gravitational simulation. Tests:
+- Floating-point compute intensity
+- Irregular memory access patterns
+- Compiler optimization of numerical kernels
+
+Expected cycles: 500,000–1,000,000
+
+## Customization
+
+### Adding a Custom Benchmark
+
+1. Create a directory under `src/`:
+   ```bash
+   mkdir -p bb-tests/workloads/src/CTest/toy/embench/src/my-bench
+   ```
+
+2. Implement the benchmark in C:
+   ```c
+   // my-bench/mybench.c
+   #include "../support/support.h"
+   
+   int main() {
+     int result = 0;
+     // Benchmark computation
+     return result;
+   }
+   ```
+
+3. Update `CMakeLists.txt` to include the new benchmark in the build
+
+4. Rebuild and run via standard test infrastructure
+
+### Modifying Benchmark Parameters
+
+Some benchmarks support configurable parameters (matrix size, iteration count, etc.). Modify via:
+
+- Preprocessor defines in `CMakeLists.txt`
+- Environment variables in `crt0.S`
+- Direct source modification in benchmark `.c` files
+
+## Interpreting Results
+
+### Performance Regression Detection
+
+Track benchmark cycle counts across Buckyball releases:
+
+```bash
+# Baseline (previous release)
+baseline_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles)
+
+# Current release
+current_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles)
+
+# Calculate regression
+regression=$(( (current_cycles - baseline_cycles) * 100 / baseline_cycles ))
+echo "Performance change: ${regression}%"
+```
+
+A regression > 5% indicates potential architecture or compiler issues.
+
+### Workload Classification
+
+Use Embench to classify Buckyball's suitability for different application domains:
+
+- **Cryptography-heavy**: Run aha-mont64, nettle-aes, nettle-sha256; compare against target ASIC
+- **Data processing**: Run sglib-combined, huffbench; measure memory efficiency
+- **Numerical**: Run nbody, cubic; evaluate floating-point pipeline utilization
+
+## Known Issues
+
+### Benchmark Hangs on Large Problem Sizes
+
+Some benchmarks (e.g., nsichneu) can timeout with large datasets in simulation. Reduce iteration counts or problem size:
+
+```bash
+# Edit benchmark source to reduce problem size
+sed -i 's/MAX_ITERATIONS 1000000/MAX_ITERATIONS 10000/' src/nsichneu/libnsichneu.c
+```
+
+### Memory Overflow in Embedded Context
+
+Embench benchmarks were designed for standard C environments. Some (e.g., picojpeg) require large buffers. Verify available memory:
+
+```bash
+# Check linker script DRAM size
+grep -A 2 "DRAM :" *.ld
+```
+
+If insufficient, enable out-of-core simulation or increase DDR size in simulation configuration.
+
+## References
+
+- **Embench Official**: https://www.embench.org/
+- **RISC-V Software Conventions**: https://github.com/riscv-non-isa/riscv-elf-psabi-doc
+- **Buckyball Workload Integration**: `bb-tests/workloads/README.md`
diff --git a/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md b/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md
new file mode 100644
index 0000000..fb40410
--- /dev/null
+++ b/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md	
@@ -0,0 +1,200 @@
+# 缓存层级结构与私有数据缓存
+
+## 概述
+
+Buckyball 采用可配置的缓存层级结构来优化不同工作负载的内存访问模式。每个瓦片支持私有的每核指令缓存和数据缓存，以及可选的每瓦片包含式 L2 缓存。最近对私有数据缓存（dcache）的重设计简化了存在私有缓存时的一致性语义，将内存管理责任转移到软件。
+
+## 缓存架构
+
+### 核心级缓存
+
+每个 Rocket 核拥有：
+
+- **指令缓存（I-Cache）**：私有、只读，通常 16–32 KB
+- **数据缓存（D-Cache）**：私有、读写，通常 16–32 KB
+
+两个缓存都采用写透语义到 L1 缺失处理逻辑。
+
+### 每瓦片包含式 L2 缓存
+
+可选的每瓦片 L2 缓存位于核的 L1 缓存和系统互连之间。L2 是**包含式**的—它保存 L1 缓存行的超集。
+
+**L2 缓存配置参数**：
+
+- **ways**：缓存关联度（通常 8–16）
+- **sets**：每路的集合数（通常 256–512）
+- **writeBytes**：写缓冲区深度
+- **portFactor**：内存端口配置因子
+- **memCycles**：性能建模估计内存延迟
+
+**具有 L2 的拓扑结构**：
+
+```
+核心 L1 I-Cache → ──┐
+核心 L1 D-Cache → ──┤ 包含式 L2 → Cork 单元 → 系统内存
+                  ──┘
+```
+
+Cork 单元（TLCacheCork）管理 L2 和其他系统代理之间的一致性。
+
+### 系统级一致性
+
+当瓦片拥有私有 L2 缓存时，系统禁用该瓦片内存域的分布式一致性。原因如下：
+
+1. 包含式 L2 充当所有核 L1 访问的**单一一致性点**
+2. 外部代理（其他瓦片、DMA）无法直接观察 L1 状态
+3. 软件必须通过内存屏障和缓存刷新操作显式管理一致性
+
+这种设计称为**软件管理一致性**，适合于：
+
+- 具有静态数据分割的工作负载（如 SPMD）
+- 瓦片间通信不频繁的系统
+- 优先考虑能效而非自动一致性的配置
+
+## 私有数据缓存重设计
+
+### 动机
+
+早期 Buckyball 版本即使在私有 L1 缓存下也能维持硬件管理一致性。这需要复杂的一致性协议和增加的逻辑开销。
+
+新设计通过要求软件在配置私有缓存时处理一致性来简化瓦片。当配置瓦片缓存为私有时，硬件不再在一致子系统中维持最后一级缓存（LLC）。
+
+### 配置模型
+
+**私有缓存模式**：
+
+当瓦片配置为私有 dcache：
+
+- 该瓦片内存域的一致子系统中不存在 LLC
+- 所有 L1 缺失通过瓦片的内存后端流动
+- 软件必须刷新缓存或使用内存屏障来确保一致性
+
+**影响**：
+
+1. **数据一致性**：软件必须在与其他瓦片共享数据前显式同步缓存
+2. **内存屏障**：在共享内存访问周围添加屏障以强制排序
+3. **缓存刷新**：在发布数据前使用平台特定的缓存刷新指令
+
+### 示例：具有私有缓存的多瓦片 SPMD
+
+```c
+#include <stdint.h>
+
+#define BARRIER_ADDR 0x60000000  // SCU 屏障地址（多心 SCU）
+#define SHARED_DATA_ADDR 0x80100000
+
+typedef struct {
+  int tile_id;
+  int result;
+} SharedData;
+
+void scu_barrier() {
+  volatile int *barrier = (volatile int *)BARRIER_ADDR;
+  *barrier = 1;  // 写屏障地址以阻挡所有核心
+}
+
+void cache_flush_dcache() {
+  // 平台特定：刷新整个 D-cache
+  // RISC-V 示例，带自定义 CSR：
+  asm volatile("fence" ::: "memory");
+}
+
+int main() {
+  int tile_id = bb_get_tile_id();
+  SharedData *shared = (SharedData *)SHARED_DATA_ADDR;
+  
+  // 阶段 1：每瓦片计算
+  int local_result = compute_tile_result(tile_id);
+  
+  // 阶段 2：发布前刷新缓存
+  cache_flush_dcache();
+  shared->tile_id = tile_id;
+  shared->result = local_result;
+  cache_flush_dcache();  // 确保写入可见
+  
+  // 阶段 3：全局同步
+  scu_barrier();
+  
+  // 阶段 4：从其他瓦片读取已发布的数据
+  // 数据现在全局可见
+  int peer_result = shared->result;
+  
+  return 0;
+}
+```
+
+## 配置示例
+
+### 启用私有 L2 缓存
+
+定义带 L2 的自定义配置：
+
+```scala
+import org.chipsalliance.cde.config._
+import freechips.rocketchip.subsystem._
+
+class GobanWithPrivateL2 extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig ++
+    new WithL2Cache(
+      ways = 8,
+      sets = 512,
+      writeBytes = 64,
+      portFactor = 1,
+      memCycles = 10
+    )
+)
+```
+
+### 禁用 L2（默认）
+
+默认情况下，Goban 仅使用核 L1 缓存：
+
+```scala
+class GobanL1Only extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig
+)
+```
+
+使用此配置，每个瓦片独立运行，采用软件管理的瓦片间一致性。
+
+## 性能考虑
+
+### 私有缓存的优势
+
+- **降低一致性开销**：L1 驱逐时无广播总线
+- **可预测的内存时序**：私有 L2 消除其他瓦片的冲突缺失
+- **能效**：较低的一致性流量降低功耗
+
+### 软件一致性成本
+
+- **显式刷新**：缓存管理指令增加代码大小和延迟
+- **假共享**：软件必须仔细分割数据以避免缓存行冲突
+- **同步延迟**：屏障强制序列化点
+
+## 故障排查
+
+### 数据一致性问题
+
+**症状**：由一个瓦片写入的值在同步后对其他瓦片不可见。
+
+**检查**：
+
+1. 验证在发布数据前存在缓存刷新指令
+2. 在共享内存访问前后使用内存屏障（RISC-V 中的 `fence`）
+3. 确认调用了同步点（如 `scu_barrier()`）
+
+### 性能下降
+
+**症状**：禁用 L2 时执行时间相对硬件一致模式显著增加。
+
+**检查**：
+
+1. 分析内存访问模式以识别频繁的缓存缺失
+2. 如果 L1 缺失率高（>10%），考虑启用 L2
+3. 验证工作集是否符合组合 L1+L2 容量
+
+## 参考资源
+
+- **包含式缓存**：SiFive InclusiveCache 文档
+- **RISC-V 内存排序**：RISC-V ISA 规范，第 8 章（内存模型）
+- **Cork 单元**：Rocket Chip 中的 TileLink 缓存管理
diff --git a/content/zh/Architecture/Goban Multi-Core Architecture.md b/content/zh/Architecture/Goban Multi-Core Architecture.md
index 88d8ae1..c944d56 100644
--- a/content/zh/Architecture/Goban Multi-Core Architecture.md	
+++ b/content/zh/Architecture/Goban Multi-Core Architecture.md	
@@ -22,15 +22,26 @@ Goban 是 Buckyball 中的一个多核 BBTile 配置，支持 SPMD（单程序
 
 ### 配置变体
 
-**BuckyballGobanConfig**
-- 1 个 BBTile × 4 核
+Goban 支持多种配置大小：
+
+**1t4c** — 1 瓦片 × 4 核
 - 4 个 Rocket 核 + 4 个 BuckyballAccelerator
 - 单个 SharedMem + BarrierUnit
+- 最小内存占用，适合单瓦片测试
+
+**4t16c** — 4 瓦片 × 4 核 = 16 个核心
+- 16 个 Rocket 核 + 16 个 BuckyballAccelerator
+- 每瓦片内存域和同步
+
+**8t8c** — 8 瓦片 × 8 核 = 64 个核心
+- 64 个 Rocket 核 + 64 个 BuckyballAccelerator
+- 每瓦片同步，扩展内存系统
+
+**遗留配置：**
+- `BuckyballGobanConfig` — 1 BBTile × 4 核
+- `BuckyballGoban2TileConfig` — 2 BBTile × 4 核 = 8 个核心
 
-**BuckyballGoban2TileConfig**
-- 2 个 BBTile × 4 核 = 8 个核心
-- 8 个 Rocket 核 + 8 个 BuckyballAccelerator
-- 每瓦片的 SharedMem + BarrierUnit
+所有变体在瓦片间维持相同的每核执行模型和屏障同步语义。
 
 ## 核心组件
 
diff --git a/content/zh/ToolChain Guide/System Control Unit.md b/content/zh/ToolChain Guide/System Control Unit.md
new file mode 100644
index 0000000..dfd165a
--- /dev/null
+++ b/content/zh/ToolChain Guide/System Control Unit.md	
@@ -0,0 +1,256 @@
+# 系统控制单元（SCU）
+
+## 概述
+
+系统控制单元（SCU）是 Buckyball 中的全局多心设备，提供模拟和核间控制功能。与早期的单心设计不同，当前 SCU 服务系统中的所有核，并通过每核可寻址的内存映射 I/O 接口访问。
+
+## 架构
+
+### 全局多心设计
+
+SCU 在系统级别实例化一次，为所有核提供统一接口。每个核都有一个专用的 SCU 地址空间子区域，计算如下：
+
+```
+hart_address = baseAddress + hartId * strideBytes
+```
+
+**参数**：
+
+- `baseAddress`：SCU 内存区域的基地址（通常 0x6000_0000）
+- `strideBytes`：每核地址步长（必须是 2 的幂，如 0x40000）
+- `totalSizeBytes`：总可寻址 SCU 区域（必须是 2 的幂，如 0x1000_0000）
+- `maxHarts`：SCU 支持的最大核数（如 64）
+
+**验证**：
+
+```
+maxHarts * strideBytes <= totalSizeBytes
+```
+
+Hart ID ≥ `maxHarts` 的地址属于系统总线未映射地址处理程序。
+
+### 地址空间布局
+
+对于 `baseAddress=0x60000000`、`strideBytes=0x40000` 和 `maxHarts=64` 的系统：
+
+```
+Hart 0:  0x60000000 – 0x6003FFFF
+Hart 1:  0x60040000 – 0x6007FFFF
+Hart 2:  0x60080000 – 0x600BFFFF
+...
+Hart 63: 0x6FFC0000 – 0x6FFFFFFF
+```
+
+## 功能
+
+### UART 输出
+
+每个核可以通过其 SCU 区域向模拟 UART 写入字符：
+
+```c
+// Hart ID 从访问核的地址空间自动推导
+volatile uint8_t *scu_uart = (volatile uint8_t *)0x60000000;  // Hart 0
+*scu_uart = 'A';  // 写入字符
+
+// 从 Hart 1：
+scu_uart = (volatile uint8_t *)0x60040000;
+*scu_uart = 'B';
+```
+
+DPI-C 桥（`SCUWriteDPI`）接收 hart ID 和字符，将输出路由到模拟控制台。
+
+### 模拟退出
+
+核可以通过写入退出代码来终止模拟：
+
+```c
+volatile uint32_t *scu_exit = (volatile uint32_t *)0x60000004;  // Hart 0
+*scu_exit = 0;  // 以代码 0 退出
+```
+
+SCU 捕获 hart ID 和退出代码，触发模拟终止。
+
+### 屏障同步（多心）
+
+SCU 提供每核屏障寄存器。当核写入其屏障地址时，它阻塞直到所有参与的核都已写入：
+
+```c
+// Hart 0
+volatile int *scu_barrier_h0 = (volatile int *)0x60000008;
+*scu_barrier_h0 = 1;  // 阻塞直到所有核到达屏障
+
+// Hart 1（同一瓦片，不同 hart ID）
+volatile int *scu_barrier_h1 = (volatile int *)0x60040008;
+*scu_barrier_h1 = 1;  // 阻塞直到所有核到达屏障
+```
+
+这不同于每瓦片 **BarrierUnit**（在 Goban 中使用），后者只同步单个瓦片内的核。
+
+## 与 P2E 线束的集成
+
+SCU 是 P2E 模拟配置中的标准组件：
+
+```scala
+class WithSCU(
+  baseAddress:    BigInt = BigInt("60000000", 16),
+  strideBytes:    BigInt = BigInt("40000", 16),
+  totalSizeBytes: BigInt = BigInt("10000000", 16),
+  maxHarts:       Int = 64
+) extends Config(
+  new sims.scu.CanHavePeripherySCU ++
+    new chipyard.config.WithTLSimpleUART
+)
+```
+
+配置设置：
+
+1. SCU 参数（基地址、步长、总大小、最大核数）
+2. DigitalTop 替换以在相干总线（CBUS）上包括 SCU
+3. 可选的 TileLink UART 用于字符输出
+
+### 引出为 P2E
+
+SCU 接入系统的 TileLink 互连：
+
+```scala
+val scu = LazyModule(new TLSCU(SCUParams(...), beatBytes))
+cbus.attach(scu.node)
+```
+
+系统中的所有核都可以通过正常 TileLink 读/写访问其各自的 SCU 区域。
+
+## DPI-C 桥
+
+`SCUWriteDPI` Verilog 模块充当所有核的单一黑盒桥：
+
+```verilog
+module SCUWriteDPI(
+  input clock,
+  input reset,
+  input [31:0] uart_hart_id,
+  input uart_valid,
+  input [7:0] uart_data,
+  input [31:0] exit_hart_id,
+  input exit_valid,
+  input [31:0] exit_code
+);
+
+  import "DPI-C" context function void scu_uart_write(
+    input int unsigned hart_id,
+    input int unsigned ch
+  );
+  
+  import "DPI-C" context function void scu_sim_exit(
+    input int unsigned hart_id,
+    input int unsigned code
+  );
+```
+
+这个单一模块取代了每核 DPI 模块，减少了 Verilog 引出和 C 导入重复。
+
+## 编程示例：多心测试
+
+```c
+#include <stdint.h>
+#include <stdio.h>
+
+#define SCU_BASE 0x60000000
+#define SCU_STRIDE 0x40000
+#define SCU_UART_OFFSET 0x00
+#define SCU_EXIT_OFFSET 0x04
+#define SCU_BARRIER_OFFSET 0x08
+
+int get_hart_id() {
+  int hart_id;
+  asm volatile("csrr %0, mhartid" : "=r"(hart_id));
+  return hart_id;
+}
+
+void scu_write_char(int hart_id, char c) {
+  volatile uint8_t *uart = (volatile uint8_t *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_UART_OFFSET);
+  *uart = c;
+}
+
+void scu_exit(int hart_id, int code) {
+  volatile uint32_t *exit = (volatile uint32_t *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_EXIT_OFFSET);
+  *exit = code;
+}
+
+void scu_barrier_wait(int hart_id) {
+  volatile int *barrier = (volatile int *)
+    (SCU_BASE + hart_id * SCU_STRIDE + SCU_BARRIER_OFFSET);
+  *barrier = 1;  // 阻塞直到所有核写入
+}
+
+int main() {
+  int hart_id = get_hart_id();
+  
+  scu_write_char(hart_id, 'H');
+  scu_write_char(hart_id, 'i');
+  
+  scu_barrier_wait(hart_id);
+  
+  if (hart_id == 0) {
+    scu_write_char(hart_id, '\n');
+  }
+  
+  scu_exit(hart_id, 0);
+  return 0;
+}
+```
+
+## P2E 中的配置
+
+要在 P2E 配置中启用 SCU：
+
+```scala
+class P2EWithSCU extends Config(
+  new sims.scu.WithSCU(
+    baseAddress = BigInt("60000000", 16),
+    strideBytes = BigInt("40000", 16),
+    totalSizeBytes = BigInt("10000000", 16),
+    maxHarts = 64
+  ) ++
+    new sims.p2e.P2EBaseConfig
+)
+```
+
+## 性能特征
+
+- **UART 写入**：通过 DPI-C 回调传播约 1–2 个周期
+- **屏障**：约 10–20 个周期开销，取决于互连延迟
+- **退出**：立即模拟终止
+
+## 故障排查
+
+### 核无法找到 SCU 地址空间
+
+**症状**：对 SCU 地址的加载或存储导致异常。
+
+**检查**：
+
+1. 验证 SCU 在系统配置中启用
+2. 确认 hart ID < `maxHarts` 参数
+3. 计算预期地址：`baseAddress + hartId * strideBytes`
+
+### 屏障无限期挂起
+
+**症状**：模拟不推进；多个核阻塞在屏障处。
+
+**检查**：
+
+1. 验证所有参与的核都已到达屏障指令
+2. 检查 hart ID 的正确计算且 < `maxHarts`
+3. 检查 VCD 波形以查看哪些核已写入屏障寄存器
+
+### 字符未出现在控制台
+
+**症状**：对 SCU 的 UART 写入不打印。
+
+**检查**：
+
+1. 验证 UART 路由到模拟控制台（检查模拟日志）
+2. 确认写入正确偏移（`SCU_UART_OFFSET = 0x00`）
+3. 检查 DPI-C 桥在顶级 Verilog 中正确连接
diff --git a/content/zh/Tutorial/Embench Benchmark Suite.md b/content/zh/Tutorial/Embench Benchmark Suite.md
new file mode 100644
index 0000000..8ebe0d4
--- /dev/null
+++ b/content/zh/Tutorial/Embench Benchmark Suite.md	
@@ -0,0 +1,279 @@
+# Embench 基准测试套件
+
+## 概述
+
+Embench 是一个综合的嵌入式系统基准测试套件，集成到 Buckyball 中用于工作负载和性能分析。它在密码学、压缩、数学计算和信号处理等多种算法中提供标准化性能测试。
+
+## 基准测试类别
+
+### 密码学与哈希
+
+- **aha-mont64**：椭圆曲线密码学的 Montgomery 乘法
+- **nettle-aes**：高级加密标准实现
+- **nettle-sha256**：SHA-256 密码学哈希函数
+- **md5sum**：MD5 哈希计算
+
+### 压缩与编码
+
+- **huffbench**：Huffman 压缩算法
+- **slre**：正则表达式引擎
+- **picojpeg**：JPEG 图像解码器
+
+### 数学与信号处理
+
+- **cubic**：三次方程求解器，含浮点运算
+- **nbody**：N 体物理模拟
+- **matmult-int**：整数矩阵乘法
+- **edn**：符号数学求值
+
+### 数据结构与算法
+
+- **sglib-combined**：泛用排序和数据结构库
+- **statemate**：有限状态机模拟
+- **tarfind**：档案搜索算法
+- **qrduino**：QR 码生成
+- **crc32**：循环冗余校验
+- **primecount**：质数计数
+- **ud**：单向解析
+- **minver**：数值算法求值
+- **wikisort**：稳定排序算法
+- **nsichneu**：复杂数值计算
+
+## 目录结构
+
+```
+bb-tests/workloads/src/CTest/toy/embench/
+├── README.md                    # Embench 文档
+├── CMakeLists.txt              # 构建配置
+├── crt0.S                       # 启动代码
+├── src/                         # 单个基准测试实现
+│   ├── aha-mont64/
+│   ├── nettle-aes/
+│   ├── nettle-sha256/
+│   ├── md5sum/
+│   └── (其他基准测试)
+└── support/                     # 通用工具
+    ├── main.c                   # 统一入口点
+    ├── boardsupport.c          # 平台初始化
+    ├── chipsupport.c           # 芯片特定功能
+    ├── beebsc.c                # 基准测试控制接口
+    └── support.h               # 通用定义
+```
+
+## 构建 Embench
+
+### 前置条件
+
+确保 Buckyball 工具链配置正确：
+
+```bash
+source sourceme.sh
+```
+
+### 构建所有基准测试
+
+```bash
+cd bb-tests/workloads
+cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench
+cmake --build build
+```
+
+这会生成单个基准测试二进制文件：
+
+```
+build/toy/embench/mont64
+build/toy/embench/aes
+build/toy/embench/sha256
+build/toy/embench/matrix-mult
+...
+```
+
+### 构建单一基准测试
+
+```bash
+cd bb-tests/workloads
+cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench -DBENCH_FILTER=sha256
+cmake --build build
+```
+
+## 运行基准测试
+
+### Verilator 模拟
+
+```bash
+bbdev verilator --run \
+  '--binary embench/mont64-baremetal \
+    --config sims.verilator.BuckyballToyVerilatorConfig \
+    --batch'
+```
+
+### P2E 模拟
+
+```bash
+bbdev p2e --run \
+  '--binary embench/aes-baremetal \
+    --config sims.p2e.P2EToyConfig \
+    --batch'
+```
+
+## 性能指标
+
+每个基准测试测量：
+
+- **周期计数**：完成的总周期数
+- **已执行指令**：动态指令计数
+- **内存流量**：加载/存储操作
+- **完成时间**：模拟中的挂钟时间
+
+### 提取结果
+
+模拟输出包括基准测试统计：
+
+```
+Benchmark: sha256
+Cycles: 142857
+Instructions: 98765
+Memory ops: 12345
+```
+
+解析这些指标来评估：
+
+1. **指令效率**：每周期指令数（IPC）
+2. **内存效率**：缓存命中率和带宽利用率
+3. **计算密度**：后硅分析中每瓦特操作数
+
+## 基准测试详情
+
+### aha-mont64（密码学）
+
+椭圆曲线操作的 Montgomery 乘法。测试：
+- 模运算性能
+- 重计算下的寄存器压力
+- 大整数数值稳定性
+
+Buckyball Toy 上的预期周期：50,000–100,000
+
+### nettle-aes（加密）
+
+AES 分组密码实现。测试：
+- 查找表效率（S-box 访问模式）
+- 紧密循环性能
+- 数据依赖缓存行为
+
+预期周期：200,000–300,000
+
+### nettle-sha256（哈希）
+
+SHA-256 密码学哈希。测试：
+- 按位操作效率
+- 状态更新期间的内存访问模式
+- 分支预测与循环密集型代码
+
+预期周期：80,000–150,000
+
+### matmult-int（矩阵乘法）
+
+可配置大小的整数矩阵乘法。测试：
+- 循环嵌套优化
+- 2D 数据访问的缓存局部性
+- 算术管道利用率
+
+预期周期：10,000–50,000（大小依赖）
+
+### nbody（物理模拟）
+
+N 体引力模拟。测试：
+- 浮点计算密度
+- 不规则内存访问模式
+- 数值核编译器优化
+
+预期周期：500,000–1,000,000
+
+## 自定义
+
+### 添加自定义基准测试
+
+1. 在 `src/` 下创建目录：
+   ```bash
+   mkdir -p bb-tests/workloads/src/CTest/toy/embench/src/my-bench
+   ```
+
+2. 用 C 实现基准测试：
+   ```c
+   // my-bench/mybench.c
+   #include "../support/support.h"
+   
+   int main() {
+     int result = 0;
+     // 基准测试计算
+     return result;
+   }
+   ```
+
+3. 更新 `CMakeLists.txt` 以在构建中包括新基准测试
+
+4. 重新构建并通过标准测试基础设施运行
+
+### 修改基准测试参数
+
+某些基准测试支持可配置参数（矩阵大小、迭代计数等）。通过以下方式修改：
+
+- `CMakeLists.txt` 中的预处理器定义
+- `crt0.S` 中的环境变量
+- 基准测试 `.c` 文件中的直接源修改
+
+## 解释结果
+
+### 性能回归检测
+
+跨 Buckyball 发布追踪基准测试周期计数：
+
+```bash
+# 基线（上一个发布）
+baseline_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles)
+
+# 当前发布
+current_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles)
+
+# 计算回归
+regression=$(( (current_cycles - baseline_cycles) * 100 / baseline_cycles ))
+echo "Performance change: ${regression}%"
+```
+
+回归 > 5% 表示潜在的架构或编译器问题。
+
+### 工作负载分类
+
+使用 Embench 对不同应用域的 Buckyball 适用性进行分类：
+
+- **密码学密集**：运行 aha-mont64、nettle-aes、nettle-sha256；与目标 ASIC 比较
+- **数据处理**：运行 sglib-combined、huffbench；测量内存效率
+- **数值**：运行 nbody、cubic；评估浮点管道利用率
+
+## 已知问题
+
+### 大问题规模上基准测试挂起
+
+某些基准测试（如 nsichneu）在模拟中可能超时，问题数据集很大。减少迭代计数或问题大小：
+
+```bash
+# 编辑基准测试源以减少问题大小
+sed -i 's/MAX_ITERATIONS 1000000/MAX_ITERATIONS 10000/' src/nsichneu/libnsichneu.c
+```
+
+### 嵌入式上下文中的内存溢出
+
+Embench 基准测试设计用于标准 C 环境。某些（如 picojpeg）需要大缓冲区。验证可用内存：
+
+```bash
+# 检查链接器脚本 DRAM 大小
+grep -A 2 "DRAM :" *.ld
+```
+
+如果不足，启用核外模拟或增加模拟配置中的 DDR 大小。
+
+## 参考资源
+
+- **Embench 官方**：https://www.embench.org/
+- **RISC-V 软件约定**：https://github.com/riscv-non-isa/riscv-elf-psabi-doc
+- **Buckyball 工作负载集成**：`bb-tests/workloads/README.md`