From 6b1d17d56da1e633f8bf8309e41acdc84713f64f Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Mon, 18 May 2026 01:06:39 +0000 Subject: [PATCH] sync: add documentation for cache hierarchy, SCU, Embench, and Goban configs Co-authored-by: Shiroha --- .../Cache Hierarchy and Private Data Cache.md | 200 +++++++++++++ .../Goban Multi-Core Architecture.md | 23 +- .../en/ToolChain Guide/System Control Unit.md | 256 ++++++++++++++++ .../en/Tutorial/Embench Benchmark Suite.md | 279 ++++++++++++++++++ .../Cache Hierarchy and Private Data Cache.md | 200 +++++++++++++ .../Goban Multi-Core Architecture.md | 23 +- .../zh/ToolChain Guide/System Control Unit.md | 256 ++++++++++++++++ .../zh/Tutorial/Embench Benchmark Suite.md | 279 ++++++++++++++++++ 8 files changed, 1504 insertions(+), 12 deletions(-) create mode 100644 content/en/Architecture/Cache Hierarchy and Private Data Cache.md create mode 100644 content/en/ToolChain Guide/System Control Unit.md create mode 100644 content/en/Tutorial/Embench Benchmark Suite.md create mode 100644 content/zh/Architecture/Cache Hierarchy and Private Data Cache.md create mode 100644 content/zh/ToolChain Guide/System Control Unit.md create mode 100644 content/zh/Tutorial/Embench Benchmark Suite.md diff --git a/content/en/Architecture/Cache Hierarchy and Private Data Cache.md b/content/en/Architecture/Cache Hierarchy and Private Data Cache.md new file mode 100644 index 0000000..4c45503 --- /dev/null +++ b/content/en/Architecture/Cache Hierarchy and Private Data Cache.md @@ -0,0 +1,200 @@ +# Cache Hierarchy and Private Data Cache + +## Overview + +Buckyball employs a configurable cache hierarchy to optimize memory access patterns for diverse workloads. Each tile supports private per-core instruction and data caches, with optional per-tile inclusive L2 caches. The recent redesign of the private data cache (dcache) simplifies coherency semantics when private caches are present, shifting memory management responsibility to software. + +## Cache Architecture + +### Core-Level Caches + +Each Rocket core in a tile has: + +- **Instruction Cache (I-Cache)**: Private, read-only, typically 16–32 KB +- **Data Cache (D-Cache)**: Private, read-write, typically 16–32 KB + +Both caches use write-through semantics to the L1 miss handling logic. + +### Per-Tile Inclusive L2 Cache + +An optional per-tile L2 cache sits between the core's L1 caches and the system interconnect. The L2 is **inclusive** — it holds a superset of L1 cache lines. + +**L2 Configuration Parameters**: + +- **ways**: Cache associativity (typically 8–16) +- **sets**: Number of sets per way (typically 256–512) +- **writeBytes**: Write buffer depth +- **portFactor**: Memory port provisioning factor +- **memCycles**: Estimated memory latency for performance modeling + +**Topology with L2**: + +``` +Core L1 I-Cache → ──┐ +Core L1 D-Cache → ──┤ Inclusive L2 → Cork Unit → System Memory + ──┘ +``` + +The cork unit (TLCacheCork) manages coherency between the L2 and other system agents. + +### System-Level Coherency + +When a tile has a private L2 cache, the system disables distributed coherency for that tile's memory domain. This is because: + +1. The inclusive L2 acts as a **single point of coherency** for all L1 accesses from cores in that tile +2. External agents (other tiles, DMA) cannot directly observe L1 state +3. Software must manage coherency explicitly through memory barriers and cache flush operations + +This design is known as **software-managed coherency** and is appropriate for: + +- Workloads with static data partitioning (e.g., SPMD) +- Systems where inter-tile communication is infrequent +- Configurations prioritizing energy efficiency over automatic coherency + +## Private Data Cache Redesign + +### Motivation + +Prior Buckyball versions could maintain hardware-managed coherency even with private L1 caches. This required complex coherency protocols and increased logic overhead. + +The new design simplifies the tile by requiring software to handle coherency when private caches are configured. Hardware no longer maintains a last-level cache (LLC) in the coherent subsystem when tile caches are private. + +### Configuration Model + +**Private Cache Mode**: + +When a tile is configured with private dcache: + +- No LLC is present in the coherent subsystem for that tile's memory domain +- All L1 misses flow through the tile's memory backend +- Software must flush caches or use memory barriers to ensure coherency + +**Implications**: + +1. **Data Consistency**: Software must explicitly synchronize caches before sharing data with other tiles +2. **Memory Barriers**: Add barriers around shared-memory access to enforce ordering +3. **Cache Flush**: Use platform-specific cache flush instructions before publishing data + +### Example: Multi-Tile SPMD with Private Caches + +```c +#include + +#define BARRIER_ADDR 0x60000000 // SCU barrier address (multi-hart SCU) +#define SHARED_DATA_ADDR 0x80100000 + +typedef struct { + int tile_id; + int result; +} SharedData; + +void scu_barrier() { + volatile int *barrier = (volatile int *)BARRIER_ADDR; + *barrier = 1; // Write barrier address to block all harts +} + +void cache_flush_dcache() { + // Platform-specific: flush entire D-cache + // Example for RISC-V with custom CSR: + asm volatile("fence" ::: "memory"); +} + +int main() { + int tile_id = bb_get_tile_id(); + SharedData *shared = (SharedData *)SHARED_DATA_ADDR; + + // Phase 1: Per-tile computation + int local_result = compute_tile_result(tile_id); + + // Phase 2: Flush cache before publishing + cache_flush_dcache(); + shared->tile_id = tile_id; + shared->result = local_result; + cache_flush_dcache(); // Ensure write is visible + + // Phase 3: Global synchronization + scu_barrier(); + + // Phase 4: Read published data from other tiles + // Data is now globally visible + int peer_result = shared->result; + + return 0; +} +``` + +## Configuration Examples + +### Enabling Private L2 Cache + +Define a custom config with L2: + +```scala +import org.chipsalliance.cde.config._ +import freechips.rocketchip.subsystem._ + +class GobanWithPrivateL2 extends Config( + new examples.goban.BuckyballGoban4T16CConfig ++ + new WithL2Cache( + ways = 8, + sets = 512, + writeBytes = 64, + portFactor = 1, + memCycles = 10 + ) +) +``` + +### Disabling L2 (Default) + +By default, Goban uses only core L1 caches: + +```scala +class GobanL1Only extends Config( + new examples.goban.BuckyballGoban4T16CConfig +) +``` + +With this configuration, each tile operates independently with software-managed inter-tile coherency. + +## Performance Considerations + +### Private Cache Benefits + +- **Reduced coherency overhead**: No broadcast bus for L1 evictions +- **Predictable memory timing**: Private L2 eliminates conflict misses from other tiles +- **Energy efficiency**: Lower coherency traffic reduces power consumption + +### Software Coherency Costs + +- **Explicit flushes**: Cache management instructions increase code size and latency +- **False sharing**: Software must partition data carefully to avoid cache line conflicts +- **Synchronization latency**: Barriers impose serialization points + +## Troubleshooting + +### Data Coherency Issues + +**Symptom**: Values written by one tile are not visible to other tiles after synchronization. + +**Check**: + +1. Verify cache flush instructions are present before publishing data +2. Use memory barriers (`fence` in RISC-V) before and after shared-memory access +3. Confirm synchronization point (e.g., `scu_barrier()`) is called after flush + +### Performance Degradation + +**Symptom**: Execution time increases significantly with L2 disabled compared to hardware-coherent mode. + +**Check**: + +1. Profile memory access patterns to identify frequent cache misses +2. Consider enabling L2 if L1 miss rate is high (>10%) +3. Verify working set fits within combined L1+L2 capacity + +## References + +- **Inclusive Cache**: SiFive InclusiveCache documentation +- **RISC-V Memory Ordering**: RISC-V ISA specification, Chapter 8 (Memory Model) +- **Cork Unit**: TileLink cache management in Rocket Chip diff --git a/content/en/Architecture/Goban Multi-Core Architecture.md b/content/en/Architecture/Goban Multi-Core Architecture.md index 9644102..c9aa9e5 100644 --- a/content/en/Architecture/Goban Multi-Core Architecture.md +++ b/content/en/Architecture/Goban Multi-Core Architecture.md @@ -22,15 +22,26 @@ Goban is a multi-core BBTile configuration in Buckyball that enables parallel ex ### Configuration Variants -**BuckyballGobanConfig** -- 1 BBTile × 4 cores +Goban supports multiple configuration sizes: + +**1t4c** — 1 tile × 4 cores - 4 Rocket cores + 4 BuckyballAccelerators - Single SharedMem + BarrierUnit +- Minimal memory footprint, suitable for single-tile testing + +**4t16c** — 4 tiles × 4 cores = 16 total cores +- 16 Rocket cores + 16 BuckyballAccelerators +- Per-tile memory domains and synchronization + +**8t8c** — 8 tiles × 8 cores = 64 total cores +- 64 Rocket cores + 64 BuckyballAccelerators +- Per-tile synchronization, scaled memory system + +**Legacy configurations:** +- `BuckyballGobanConfig` — 1 BBTile × 4 cores +- `BuckyballGoban2TileConfig` — 2 BBTiles × 4 cores = 8 total cores -**BuckyballGoban2TileConfig** -- 2 BBTiles × 4 cores = 8 total cores -- 8 Rocket cores + 8 BuckyballAccelerators -- Per-tile SharedMem + BarrierUnit +All variants maintain the same per-core execution model and barrier synchronization semantics across tiles. ## Core Components diff --git a/content/en/ToolChain Guide/System Control Unit.md b/content/en/ToolChain Guide/System Control Unit.md new file mode 100644 index 0000000..d5c7adf --- /dev/null +++ b/content/en/ToolChain Guide/System Control Unit.md @@ -0,0 +1,256 @@ +# System Control Unit (SCU) + +## Overview + +The System Control Unit (SCU) is a global multi-hart device in Buckyball that provides simulation and inter-hart control functionality. Unlike earlier single-hart designs, the current SCU serves all harts in the system and is accessed through a per-hart addressable memory-mapped I/O interface. + +## Architecture + +### Global Multi-Hart Design + +The SCU is instantiated once at the system level and provides a unified interface for all harts. Each hart has a dedicated sub-region of the SCU address space, calculated as: + +``` +hart_address = baseAddress + hartId * strideBytes +``` + +**Parameters**: + +- `baseAddress`: Base address of SCU memory region (typically 0x6000_0000) +- `strideBytes`: Per-hart address stride (must be a power of two, e.g., 0x40000) +- `totalSizeBytes`: Total addressable SCU region (must be a power of two, e.g., 0x1000_0000) +- `maxHarts`: Maximum number of harts the SCU supports (e.g., 64) + +**Validation**: + +``` +maxHarts * strideBytes <= totalSizeBytes +``` + +Addresses for hart IDs ≥ `maxHarts` fall through to the system bus unmapped address handler. + +### Address Space Layout + +For a system with `baseAddress=0x60000000`, `strideBytes=0x40000`, and `maxHarts=64`: + +``` +Hart 0: 0x60000000 – 0x6003FFFF +Hart 1: 0x60040000 – 0x6007FFFF +Hart 2: 0x60080000 – 0x600BFFFF +... +Hart 63: 0x6FFC0000 – 0x6FFFFFFF +``` + +## Functionality + +### UART Output + +Each hart can write characters to simulation UART via its SCU region: + +```c +// Hart ID is automatically inferred from accessing hart's address space +volatile uint8_t *scu_uart = (volatile uint8_t *)0x60000000; // Hart 0 +*scu_uart = 'A'; // Write character + +// From hart 1: +scu_uart = (volatile uint8_t *)0x60040000; +*scu_uart = 'B'; +``` + +The DPI-C bridge (`SCUWriteDPI`) receives the hart ID and character, routing output to the simulation console. + +### Simulation Exit + +Harts can terminate simulation by writing an exit code: + +```c +volatile uint32_t *scu_exit = (volatile uint32_t *)0x60000004; // Hart 0 +*scu_exit = 0; // Exit with code 0 +``` + +The SCU captures the hart ID and exit code, triggering simulation termination. + +### Barrier Synchronization (Multi-Hart) + +The SCU provides a per-hart barrier register. When a hart writes to its barrier address, it blocks until all participating harts have written: + +```c +// Hart 0 +volatile int *scu_barrier_h0 = (volatile int *)0x60000008; +*scu_barrier_h0 = 1; // Block until all harts reach barrier + +// Hart 1 (on same tile, different hart ID) +volatile int *scu_barrier_h1 = (volatile int *)0x60040008; +*scu_barrier_h1 = 1; // Block until all harts reach barrier +``` + +This differs from the per-tile **BarrierUnit** (used in Goban), which synchronizes only cores within a single tile. + +## Integration with P2E Harness + +The SCU is a standard component in P2E simulation configurations: + +```scala +class WithSCU( + baseAddress: BigInt = BigInt("60000000", 16), + strideBytes: BigInt = BigInt("40000", 16), + totalSizeBytes: BigInt = BigInt("10000000", 16), + maxHarts: Int = 64 +) extends Config( + new sims.scu.CanHavePeripherySCU ++ + new chipyard.config.WithTLSimpleUART +) +``` + +The configuration sets: + +1. SCU parameters (base, stride, total size, max harts) +2. DigitalTop replacement to include SCU on the coherent bus (CBUS) +3. Optional TileLink UART for character output + +### Elaboration into P2E + +The SCU is wired to the system's TileLink interconnect: + +```scala +val scu = LazyModule(new TLSCU(SCUParams(...), beatBytes)) +cbus.attach(scu.node) +``` + +All harts on the system can access their respective SCU regions via normal TileLink reads/writes. + +## DPI-C Bridge + +The `SCUWriteDPI` Verilog module acts as a single black-box bridge for all harts: + +```verilog +module SCUWriteDPI( + input clock, + input reset, + input [31:0] uart_hart_id, + input uart_valid, + input [7:0] uart_data, + input [31:0] exit_hart_id, + input exit_valid, + input [31:0] exit_code +); + + import "DPI-C" context function void scu_uart_write( + input int unsigned hart_id, + input int unsigned ch + ); + + import "DPI-C" context function void scu_sim_exit( + input int unsigned hart_id, + input int unsigned code + ); +``` + +This single module replaces per-hart DPI modules, reducing Verilog elaboration and C import duplication. + +## Programming Example: Multi-Hart Test + +```c +#include +#include + +#define SCU_BASE 0x60000000 +#define SCU_STRIDE 0x40000 +#define SCU_UART_OFFSET 0x00 +#define SCU_EXIT_OFFSET 0x04 +#define SCU_BARRIER_OFFSET 0x08 + +int get_hart_id() { + int hart_id; + asm volatile("csrr %0, mhartid" : "=r"(hart_id)); + return hart_id; +} + +void scu_write_char(int hart_id, char c) { + volatile uint8_t *uart = (volatile uint8_t *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_UART_OFFSET); + *uart = c; +} + +void scu_exit(int hart_id, int code) { + volatile uint32_t *exit = (volatile uint32_t *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_EXIT_OFFSET); + *exit = code; +} + +void scu_barrier_wait(int hart_id) { + volatile int *barrier = (volatile int *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_BARRIER_OFFSET); + *barrier = 1; // Block until all harts write +} + +int main() { + int hart_id = get_hart_id(); + + scu_write_char(hart_id, 'H'); + scu_write_char(hart_id, 'i'); + + scu_barrier_wait(hart_id); + + if (hart_id == 0) { + scu_write_char(hart_id, '\n'); + } + + scu_exit(hart_id, 0); + return 0; +} +``` + +## Configuration in P2E + +To enable SCU in a P2E configuration: + +```scala +class P2EWithSCU extends Config( + new sims.scu.WithSCU( + baseAddress = BigInt("60000000", 16), + strideBytes = BigInt("40000", 16), + totalSizeBytes = BigInt("10000000", 16), + maxHarts = 64 + ) ++ + new sims.p2e.P2EBaseConfig +) +``` + +## Performance Characteristics + +- **UART Write**: ~1–2 cycles to propagate through DPI-C callback +- **Barrier**: ~10–20 cycles overhead depending on interconnect latency +- **Exit**: Immediate simulation termination + +## Troubleshooting + +### Hart Cannot Find SCU Address Space + +**Symptom**: Load or store to SCU address results in exception. + +**Check**: + +1. Verify SCU is enabled in system configuration +2. Confirm hart ID is < `maxHarts` parameter +3. Calculate expected address: `baseAddress + hartId * strideBytes` + +### Barrier Hangs Indefinitely + +**Symptom**: Simulation does not advance; multiple harts blocked on barrier. + +**Check**: + +1. Verify all participating harts have reached the barrier instruction +2. Check that hart IDs are correctly calculated and < `maxHarts` +3. Inspect VCD waveforms to see which harts have written to barrier register + +### Characters Not Appearing in Console + +**Symptom**: UART writes to SCU do not print. + +**Check**: + +1. Verify UART is routed to simulation console (check simulation log) +2. Confirm writes are to correct offset (`SCU_UART_OFFSET = 0x00`) +3. Check that DPI-C bridge is properly connected in top-level Verilog diff --git a/content/en/Tutorial/Embench Benchmark Suite.md b/content/en/Tutorial/Embench Benchmark Suite.md new file mode 100644 index 0000000..33ffe5f --- /dev/null +++ b/content/en/Tutorial/Embench Benchmark Suite.md @@ -0,0 +1,279 @@ +# Embench Benchmark Suite + +## Overview + +Embench is a comprehensive embedded systems benchmark suite integrated into Buckyball for workload and performance analysis. It provides standardized performance tests across diverse algorithms including cryptography, compression, mathematical computation, and signal processing. + +## Benchmark Categories + +### Cryptography and Hashing + +- **aha-mont64**: Montgomery multiplication for elliptic curve cryptography +- **nettle-aes**: Advanced Encryption Standard implementation +- **nettle-sha256**: SHA-256 cryptographic hash function +- **md5sum**: MD5 hash computation + +### Compression and Encoding + +- **huffbench**: Huffman compression algorithm +- **slre**: Regular expression engine +- **picojpeg**: JPEG image decoder + +### Mathematical and Signal Processing + +- **cubic**: Cubic equation solver with floating-point arithmetic +- **nbody**: N-body physics simulation +- **matmult-int**: Integer matrix multiplication +- **edn**: Symbolic mathematics evaluation + +### Data Structures and Algorithms + +- **sglib-combined**: Generic sorting and data structure library +- **statemate**: Finite state machine simulation +- **tarfind**: Archive search algorithm +- **qrduino**: QR code generation +- **crc32**: Cyclic redundancy check +- **primecount**: Prime number counting +- **ud**: Unidirectional parsing +- **minver**: Numerical algorithm evaluation +- **wikisort**: Stable sorting algorithm +- **nsichneu**: Complex numerical computation + +## Directory Structure + +``` +bb-tests/workloads/src/CTest/toy/embench/ +├── README.md # Embench documentation +├── CMakeLists.txt # Build configuration +├── crt0.S # Startup code +├── src/ # Individual benchmark implementations +│ ├── aha-mont64/ +│ ├── nettle-aes/ +│ ├── nettle-sha256/ +│ ├── md5sum/ +│ └── (other benchmarks) +└── support/ # Common utilities + ├── main.c # Unified entry point + ├── boardsupport.c # Platform initialization + ├── chipsupport.c # Chip-specific features + ├── beebsc.c # Benchmark control interface + └── support.h # Common definitions +``` + +## Building Embench + +### Prerequisites + +Ensure the Buckyball toolchain is properly configured: + +```bash +source sourceme.sh +``` + +### Build All Benchmarks + +```bash +cd bb-tests/workloads +cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench +cmake --build build +``` + +This generates individual benchmark binaries: + +``` +build/toy/embench/mont64 +build/toy/embench/aes +build/toy/embench/sha256 +build/toy/embench/matrix-mult +... +``` + +### Build Single Benchmark + +```bash +cd bb-tests/workloads +cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench -DBENCH_FILTER=sha256 +cmake --build build +``` + +## Running Benchmarks + +### Verilator Simulation + +```bash +bbdev verilator --run \ + '--binary embench/mont64-baremetal \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +### P2E Simulation + +```bash +bbdev p2e --run \ + '--binary embench/aes-baremetal \ + --config sims.p2e.P2EToyConfig \ + --batch' +``` + +## Performance Metrics + +Each benchmark measures: + +- **Cycle count**: Total cycles to completion +- **Instructions executed**: Dynamic instruction count +- **Memory traffic**: Load/store operations +- **Time-to-completion**: Wall-clock time in simulation + +### Extracting Results + +Simulation output includes benchmark statistics: + +``` +Benchmark: sha256 +Cycles: 142857 +Instructions: 98765 +Memory ops: 12345 +``` + +Parse these metrics to evaluate: + +1. **Instruction efficiency**: Instructions per cycle (IPC) +2. **Memory efficiency**: Cache hit rates and bandwidth utilization +3. **Compute density**: Operations per watt in post-silicon analysis + +## Benchmark Details + +### aha-mont64 (Cryptography) + +Montgomery multiplication for elliptic curve operations. Tests: +- Modular arithmetic performance +- Register pressure under heavy computation +- Numeric stability with large integers + +Expected cycles: 50,000–100,000 on Buckyball Toy + +### nettle-aes (Encryption) + +AES block cipher implementation. Tests: +- Lookup table efficiency (S-box access patterns) +- Tight loop performance +- Data-dependent cache behavior + +Expected cycles: 200,000–300,000 + +### nettle-sha256 (Hashing) + +SHA-256 cryptographic hash. Tests: +- Bitwise operation efficiency +- Memory access patterns during state updates +- Branch prediction with loop-heavy code + +Expected cycles: 80,000–150,000 + +### matmult-int (Matrix Multiplication) + +Integer matrix multiplication with configurable sizes. Tests: +- Loop nest optimization +- Cache locality for 2D data access +- Arithmetic pipeline utilization + +Expected cycles: 10,000–50,000 (size-dependent) + +### nbody (Physics Simulation) + +N-body gravitational simulation. Tests: +- Floating-point compute intensity +- Irregular memory access patterns +- Compiler optimization of numerical kernels + +Expected cycles: 500,000–1,000,000 + +## Customization + +### Adding a Custom Benchmark + +1. Create a directory under `src/`: + ```bash + mkdir -p bb-tests/workloads/src/CTest/toy/embench/src/my-bench + ``` + +2. Implement the benchmark in C: + ```c + // my-bench/mybench.c + #include "../support/support.h" + + int main() { + int result = 0; + // Benchmark computation + return result; + } + ``` + +3. Update `CMakeLists.txt` to include the new benchmark in the build + +4. Rebuild and run via standard test infrastructure + +### Modifying Benchmark Parameters + +Some benchmarks support configurable parameters (matrix size, iteration count, etc.). Modify via: + +- Preprocessor defines in `CMakeLists.txt` +- Environment variables in `crt0.S` +- Direct source modification in benchmark `.c` files + +## Interpreting Results + +### Performance Regression Detection + +Track benchmark cycle counts across Buckyball releases: + +```bash +# Baseline (previous release) +baseline_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles) + +# Current release +current_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles) + +# Calculate regression +regression=$(( (current_cycles - baseline_cycles) * 100 / baseline_cycles )) +echo "Performance change: ${regression}%" +``` + +A regression > 5% indicates potential architecture or compiler issues. + +### Workload Classification + +Use Embench to classify Buckyball's suitability for different application domains: + +- **Cryptography-heavy**: Run aha-mont64, nettle-aes, nettle-sha256; compare against target ASIC +- **Data processing**: Run sglib-combined, huffbench; measure memory efficiency +- **Numerical**: Run nbody, cubic; evaluate floating-point pipeline utilization + +## Known Issues + +### Benchmark Hangs on Large Problem Sizes + +Some benchmarks (e.g., nsichneu) can timeout with large datasets in simulation. Reduce iteration counts or problem size: + +```bash +# Edit benchmark source to reduce problem size +sed -i 's/MAX_ITERATIONS 1000000/MAX_ITERATIONS 10000/' src/nsichneu/libnsichneu.c +``` + +### Memory Overflow in Embedded Context + +Embench benchmarks were designed for standard C environments. Some (e.g., picojpeg) require large buffers. Verify available memory: + +```bash +# Check linker script DRAM size +grep -A 2 "DRAM :" *.ld +``` + +If insufficient, enable out-of-core simulation or increase DDR size in simulation configuration. + +## References + +- **Embench Official**: https://www.embench.org/ +- **RISC-V Software Conventions**: https://github.com/riscv-non-isa/riscv-elf-psabi-doc +- **Buckyball Workload Integration**: `bb-tests/workloads/README.md` diff --git a/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md b/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md new file mode 100644 index 0000000..fb40410 --- /dev/null +++ b/content/zh/Architecture/Cache Hierarchy and Private Data Cache.md @@ -0,0 +1,200 @@ +# 缓存层级结构与私有数据缓存 + +## 概述 + +Buckyball 采用可配置的缓存层级结构来优化不同工作负载的内存访问模式。每个瓦片支持私有的每核指令缓存和数据缓存,以及可选的每瓦片包含式 L2 缓存。最近对私有数据缓存(dcache)的重设计简化了存在私有缓存时的一致性语义,将内存管理责任转移到软件。 + +## 缓存架构 + +### 核心级缓存 + +每个 Rocket 核拥有: + +- **指令缓存(I-Cache)**:私有、只读,通常 16–32 KB +- **数据缓存(D-Cache)**:私有、读写,通常 16–32 KB + +两个缓存都采用写透语义到 L1 缺失处理逻辑。 + +### 每瓦片包含式 L2 缓存 + +可选的每瓦片 L2 缓存位于核的 L1 缓存和系统互连之间。L2 是**包含式**的—它保存 L1 缓存行的超集。 + +**L2 缓存配置参数**: + +- **ways**:缓存关联度(通常 8–16) +- **sets**:每路的集合数(通常 256–512) +- **writeBytes**:写缓冲区深度 +- **portFactor**:内存端口配置因子 +- **memCycles**:性能建模估计内存延迟 + +**具有 L2 的拓扑结构**: + +``` +核心 L1 I-Cache → ──┐ +核心 L1 D-Cache → ──┤ 包含式 L2 → Cork 单元 → 系统内存 + ──┘ +``` + +Cork 单元(TLCacheCork)管理 L2 和其他系统代理之间的一致性。 + +### 系统级一致性 + +当瓦片拥有私有 L2 缓存时,系统禁用该瓦片内存域的分布式一致性。原因如下: + +1. 包含式 L2 充当所有核 L1 访问的**单一一致性点** +2. 外部代理(其他瓦片、DMA)无法直接观察 L1 状态 +3. 软件必须通过内存屏障和缓存刷新操作显式管理一致性 + +这种设计称为**软件管理一致性**,适合于: + +- 具有静态数据分割的工作负载(如 SPMD) +- 瓦片间通信不频繁的系统 +- 优先考虑能效而非自动一致性的配置 + +## 私有数据缓存重设计 + +### 动机 + +早期 Buckyball 版本即使在私有 L1 缓存下也能维持硬件管理一致性。这需要复杂的一致性协议和增加的逻辑开销。 + +新设计通过要求软件在配置私有缓存时处理一致性来简化瓦片。当配置瓦片缓存为私有时,硬件不再在一致子系统中维持最后一级缓存(LLC)。 + +### 配置模型 + +**私有缓存模式**: + +当瓦片配置为私有 dcache: + +- 该瓦片内存域的一致子系统中不存在 LLC +- 所有 L1 缺失通过瓦片的内存后端流动 +- 软件必须刷新缓存或使用内存屏障来确保一致性 + +**影响**: + +1. **数据一致性**:软件必须在与其他瓦片共享数据前显式同步缓存 +2. **内存屏障**:在共享内存访问周围添加屏障以强制排序 +3. **缓存刷新**:在发布数据前使用平台特定的缓存刷新指令 + +### 示例:具有私有缓存的多瓦片 SPMD + +```c +#include + +#define BARRIER_ADDR 0x60000000 // SCU 屏障地址(多心 SCU) +#define SHARED_DATA_ADDR 0x80100000 + +typedef struct { + int tile_id; + int result; +} SharedData; + +void scu_barrier() { + volatile int *barrier = (volatile int *)BARRIER_ADDR; + *barrier = 1; // 写屏障地址以阻挡所有核心 +} + +void cache_flush_dcache() { + // 平台特定:刷新整个 D-cache + // RISC-V 示例,带自定义 CSR: + asm volatile("fence" ::: "memory"); +} + +int main() { + int tile_id = bb_get_tile_id(); + SharedData *shared = (SharedData *)SHARED_DATA_ADDR; + + // 阶段 1:每瓦片计算 + int local_result = compute_tile_result(tile_id); + + // 阶段 2:发布前刷新缓存 + cache_flush_dcache(); + shared->tile_id = tile_id; + shared->result = local_result; + cache_flush_dcache(); // 确保写入可见 + + // 阶段 3:全局同步 + scu_barrier(); + + // 阶段 4:从其他瓦片读取已发布的数据 + // 数据现在全局可见 + int peer_result = shared->result; + + return 0; +} +``` + +## 配置示例 + +### 启用私有 L2 缓存 + +定义带 L2 的自定义配置: + +```scala +import org.chipsalliance.cde.config._ +import freechips.rocketchip.subsystem._ + +class GobanWithPrivateL2 extends Config( + new examples.goban.BuckyballGoban4T16CConfig ++ + new WithL2Cache( + ways = 8, + sets = 512, + writeBytes = 64, + portFactor = 1, + memCycles = 10 + ) +) +``` + +### 禁用 L2(默认) + +默认情况下,Goban 仅使用核 L1 缓存: + +```scala +class GobanL1Only extends Config( + new examples.goban.BuckyballGoban4T16CConfig +) +``` + +使用此配置,每个瓦片独立运行,采用软件管理的瓦片间一致性。 + +## 性能考虑 + +### 私有缓存的优势 + +- **降低一致性开销**:L1 驱逐时无广播总线 +- **可预测的内存时序**:私有 L2 消除其他瓦片的冲突缺失 +- **能效**:较低的一致性流量降低功耗 + +### 软件一致性成本 + +- **显式刷新**:缓存管理指令增加代码大小和延迟 +- **假共享**:软件必须仔细分割数据以避免缓存行冲突 +- **同步延迟**:屏障强制序列化点 + +## 故障排查 + +### 数据一致性问题 + +**症状**:由一个瓦片写入的值在同步后对其他瓦片不可见。 + +**检查**: + +1. 验证在发布数据前存在缓存刷新指令 +2. 在共享内存访问前后使用内存屏障(RISC-V 中的 `fence`) +3. 确认调用了同步点(如 `scu_barrier()`) + +### 性能下降 + +**症状**:禁用 L2 时执行时间相对硬件一致模式显著增加。 + +**检查**: + +1. 分析内存访问模式以识别频繁的缓存缺失 +2. 如果 L1 缺失率高(>10%),考虑启用 L2 +3. 验证工作集是否符合组合 L1+L2 容量 + +## 参考资源 + +- **包含式缓存**:SiFive InclusiveCache 文档 +- **RISC-V 内存排序**:RISC-V ISA 规范,第 8 章(内存模型) +- **Cork 单元**:Rocket Chip 中的 TileLink 缓存管理 diff --git a/content/zh/Architecture/Goban Multi-Core Architecture.md b/content/zh/Architecture/Goban Multi-Core Architecture.md index 88d8ae1..c944d56 100644 --- a/content/zh/Architecture/Goban Multi-Core Architecture.md +++ b/content/zh/Architecture/Goban Multi-Core Architecture.md @@ -22,15 +22,26 @@ Goban 是 Buckyball 中的一个多核 BBTile 配置,支持 SPMD(单程序 ### 配置变体 -**BuckyballGobanConfig** -- 1 个 BBTile × 4 核 +Goban 支持多种配置大小: + +**1t4c** — 1 瓦片 × 4 核 - 4 个 Rocket 核 + 4 个 BuckyballAccelerator - 单个 SharedMem + BarrierUnit +- 最小内存占用,适合单瓦片测试 + +**4t16c** — 4 瓦片 × 4 核 = 16 个核心 +- 16 个 Rocket 核 + 16 个 BuckyballAccelerator +- 每瓦片内存域和同步 + +**8t8c** — 8 瓦片 × 8 核 = 64 个核心 +- 64 个 Rocket 核 + 64 个 BuckyballAccelerator +- 每瓦片同步,扩展内存系统 + +**遗留配置:** +- `BuckyballGobanConfig` — 1 BBTile × 4 核 +- `BuckyballGoban2TileConfig` — 2 BBTile × 4 核 = 8 个核心 -**BuckyballGoban2TileConfig** -- 2 个 BBTile × 4 核 = 8 个核心 -- 8 个 Rocket 核 + 8 个 BuckyballAccelerator -- 每瓦片的 SharedMem + BarrierUnit +所有变体在瓦片间维持相同的每核执行模型和屏障同步语义。 ## 核心组件 diff --git a/content/zh/ToolChain Guide/System Control Unit.md b/content/zh/ToolChain Guide/System Control Unit.md new file mode 100644 index 0000000..dfd165a --- /dev/null +++ b/content/zh/ToolChain Guide/System Control Unit.md @@ -0,0 +1,256 @@ +# 系统控制单元(SCU) + +## 概述 + +系统控制单元(SCU)是 Buckyball 中的全局多心设备,提供模拟和核间控制功能。与早期的单心设计不同,当前 SCU 服务系统中的所有核,并通过每核可寻址的内存映射 I/O 接口访问。 + +## 架构 + +### 全局多心设计 + +SCU 在系统级别实例化一次,为所有核提供统一接口。每个核都有一个专用的 SCU 地址空间子区域,计算如下: + +``` +hart_address = baseAddress + hartId * strideBytes +``` + +**参数**: + +- `baseAddress`:SCU 内存区域的基地址(通常 0x6000_0000) +- `strideBytes`:每核地址步长(必须是 2 的幂,如 0x40000) +- `totalSizeBytes`:总可寻址 SCU 区域(必须是 2 的幂,如 0x1000_0000) +- `maxHarts`:SCU 支持的最大核数(如 64) + +**验证**: + +``` +maxHarts * strideBytes <= totalSizeBytes +``` + +Hart ID ≥ `maxHarts` 的地址属于系统总线未映射地址处理程序。 + +### 地址空间布局 + +对于 `baseAddress=0x60000000`、`strideBytes=0x40000` 和 `maxHarts=64` 的系统: + +``` +Hart 0: 0x60000000 – 0x6003FFFF +Hart 1: 0x60040000 – 0x6007FFFF +Hart 2: 0x60080000 – 0x600BFFFF +... +Hart 63: 0x6FFC0000 – 0x6FFFFFFF +``` + +## 功能 + +### UART 输出 + +每个核可以通过其 SCU 区域向模拟 UART 写入字符: + +```c +// Hart ID 从访问核的地址空间自动推导 +volatile uint8_t *scu_uart = (volatile uint8_t *)0x60000000; // Hart 0 +*scu_uart = 'A'; // 写入字符 + +// 从 Hart 1: +scu_uart = (volatile uint8_t *)0x60040000; +*scu_uart = 'B'; +``` + +DPI-C 桥(`SCUWriteDPI`)接收 hart ID 和字符,将输出路由到模拟控制台。 + +### 模拟退出 + +核可以通过写入退出代码来终止模拟: + +```c +volatile uint32_t *scu_exit = (volatile uint32_t *)0x60000004; // Hart 0 +*scu_exit = 0; // 以代码 0 退出 +``` + +SCU 捕获 hart ID 和退出代码,触发模拟终止。 + +### 屏障同步(多心) + +SCU 提供每核屏障寄存器。当核写入其屏障地址时,它阻塞直到所有参与的核都已写入: + +```c +// Hart 0 +volatile int *scu_barrier_h0 = (volatile int *)0x60000008; +*scu_barrier_h0 = 1; // 阻塞直到所有核到达屏障 + +// Hart 1(同一瓦片,不同 hart ID) +volatile int *scu_barrier_h1 = (volatile int *)0x60040008; +*scu_barrier_h1 = 1; // 阻塞直到所有核到达屏障 +``` + +这不同于每瓦片 **BarrierUnit**(在 Goban 中使用),后者只同步单个瓦片内的核。 + +## 与 P2E 线束的集成 + +SCU 是 P2E 模拟配置中的标准组件: + +```scala +class WithSCU( + baseAddress: BigInt = BigInt("60000000", 16), + strideBytes: BigInt = BigInt("40000", 16), + totalSizeBytes: BigInt = BigInt("10000000", 16), + maxHarts: Int = 64 +) extends Config( + new sims.scu.CanHavePeripherySCU ++ + new chipyard.config.WithTLSimpleUART +) +``` + +配置设置: + +1. SCU 参数(基地址、步长、总大小、最大核数) +2. DigitalTop 替换以在相干总线(CBUS)上包括 SCU +3. 可选的 TileLink UART 用于字符输出 + +### 引出为 P2E + +SCU 接入系统的 TileLink 互连: + +```scala +val scu = LazyModule(new TLSCU(SCUParams(...), beatBytes)) +cbus.attach(scu.node) +``` + +系统中的所有核都可以通过正常 TileLink 读/写访问其各自的 SCU 区域。 + +## DPI-C 桥 + +`SCUWriteDPI` Verilog 模块充当所有核的单一黑盒桥: + +```verilog +module SCUWriteDPI( + input clock, + input reset, + input [31:0] uart_hart_id, + input uart_valid, + input [7:0] uart_data, + input [31:0] exit_hart_id, + input exit_valid, + input [31:0] exit_code +); + + import "DPI-C" context function void scu_uart_write( + input int unsigned hart_id, + input int unsigned ch + ); + + import "DPI-C" context function void scu_sim_exit( + input int unsigned hart_id, + input int unsigned code + ); +``` + +这个单一模块取代了每核 DPI 模块,减少了 Verilog 引出和 C 导入重复。 + +## 编程示例:多心测试 + +```c +#include +#include + +#define SCU_BASE 0x60000000 +#define SCU_STRIDE 0x40000 +#define SCU_UART_OFFSET 0x00 +#define SCU_EXIT_OFFSET 0x04 +#define SCU_BARRIER_OFFSET 0x08 + +int get_hart_id() { + int hart_id; + asm volatile("csrr %0, mhartid" : "=r"(hart_id)); + return hart_id; +} + +void scu_write_char(int hart_id, char c) { + volatile uint8_t *uart = (volatile uint8_t *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_UART_OFFSET); + *uart = c; +} + +void scu_exit(int hart_id, int code) { + volatile uint32_t *exit = (volatile uint32_t *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_EXIT_OFFSET); + *exit = code; +} + +void scu_barrier_wait(int hart_id) { + volatile int *barrier = (volatile int *) + (SCU_BASE + hart_id * SCU_STRIDE + SCU_BARRIER_OFFSET); + *barrier = 1; // 阻塞直到所有核写入 +} + +int main() { + int hart_id = get_hart_id(); + + scu_write_char(hart_id, 'H'); + scu_write_char(hart_id, 'i'); + + scu_barrier_wait(hart_id); + + if (hart_id == 0) { + scu_write_char(hart_id, '\n'); + } + + scu_exit(hart_id, 0); + return 0; +} +``` + +## P2E 中的配置 + +要在 P2E 配置中启用 SCU: + +```scala +class P2EWithSCU extends Config( + new sims.scu.WithSCU( + baseAddress = BigInt("60000000", 16), + strideBytes = BigInt("40000", 16), + totalSizeBytes = BigInt("10000000", 16), + maxHarts = 64 + ) ++ + new sims.p2e.P2EBaseConfig +) +``` + +## 性能特征 + +- **UART 写入**:通过 DPI-C 回调传播约 1–2 个周期 +- **屏障**:约 10–20 个周期开销,取决于互连延迟 +- **退出**:立即模拟终止 + +## 故障排查 + +### 核无法找到 SCU 地址空间 + +**症状**:对 SCU 地址的加载或存储导致异常。 + +**检查**: + +1. 验证 SCU 在系统配置中启用 +2. 确认 hart ID < `maxHarts` 参数 +3. 计算预期地址:`baseAddress + hartId * strideBytes` + +### 屏障无限期挂起 + +**症状**:模拟不推进;多个核阻塞在屏障处。 + +**检查**: + +1. 验证所有参与的核都已到达屏障指令 +2. 检查 hart ID 的正确计算且 < `maxHarts` +3. 检查 VCD 波形以查看哪些核已写入屏障寄存器 + +### 字符未出现在控制台 + +**症状**:对 SCU 的 UART 写入不打印。 + +**检查**: + +1. 验证 UART 路由到模拟控制台(检查模拟日志) +2. 确认写入正确偏移(`SCU_UART_OFFSET = 0x00`) +3. 检查 DPI-C 桥在顶级 Verilog 中正确连接 diff --git a/content/zh/Tutorial/Embench Benchmark Suite.md b/content/zh/Tutorial/Embench Benchmark Suite.md new file mode 100644 index 0000000..8ebe0d4 --- /dev/null +++ b/content/zh/Tutorial/Embench Benchmark Suite.md @@ -0,0 +1,279 @@ +# Embench 基准测试套件 + +## 概述 + +Embench 是一个综合的嵌入式系统基准测试套件,集成到 Buckyball 中用于工作负载和性能分析。它在密码学、压缩、数学计算和信号处理等多种算法中提供标准化性能测试。 + +## 基准测试类别 + +### 密码学与哈希 + +- **aha-mont64**:椭圆曲线密码学的 Montgomery 乘法 +- **nettle-aes**:高级加密标准实现 +- **nettle-sha256**:SHA-256 密码学哈希函数 +- **md5sum**:MD5 哈希计算 + +### 压缩与编码 + +- **huffbench**:Huffman 压缩算法 +- **slre**:正则表达式引擎 +- **picojpeg**:JPEG 图像解码器 + +### 数学与信号处理 + +- **cubic**:三次方程求解器,含浮点运算 +- **nbody**:N 体物理模拟 +- **matmult-int**:整数矩阵乘法 +- **edn**:符号数学求值 + +### 数据结构与算法 + +- **sglib-combined**:泛用排序和数据结构库 +- **statemate**:有限状态机模拟 +- **tarfind**:档案搜索算法 +- **qrduino**:QR 码生成 +- **crc32**:循环冗余校验 +- **primecount**:质数计数 +- **ud**:单向解析 +- **minver**:数值算法求值 +- **wikisort**:稳定排序算法 +- **nsichneu**:复杂数值计算 + +## 目录结构 + +``` +bb-tests/workloads/src/CTest/toy/embench/ +├── README.md # Embench 文档 +├── CMakeLists.txt # 构建配置 +├── crt0.S # 启动代码 +├── src/ # 单个基准测试实现 +│ ├── aha-mont64/ +│ ├── nettle-aes/ +│ ├── nettle-sha256/ +│ ├── md5sum/ +│ └── (其他基准测试) +└── support/ # 通用工具 + ├── main.c # 统一入口点 + ├── boardsupport.c # 平台初始化 + ├── chipsupport.c # 芯片特定功能 + ├── beebsc.c # 基准测试控制接口 + └── support.h # 通用定义 +``` + +## 构建 Embench + +### 前置条件 + +确保 Buckyball 工具链配置正确: + +```bash +source sourceme.sh +``` + +### 构建所有基准测试 + +```bash +cd bb-tests/workloads +cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench +cmake --build build +``` + +这会生成单个基准测试二进制文件: + +``` +build/toy/embench/mont64 +build/toy/embench/aes +build/toy/embench/sha256 +build/toy/embench/matrix-mult +... +``` + +### 构建单一基准测试 + +```bash +cd bb-tests/workloads +cmake -B build -DCTEST_TARGET=toy -DCTEST_NAME=embench -DBENCH_FILTER=sha256 +cmake --build build +``` + +## 运行基准测试 + +### Verilator 模拟 + +```bash +bbdev verilator --run \ + '--binary embench/mont64-baremetal \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +### P2E 模拟 + +```bash +bbdev p2e --run \ + '--binary embench/aes-baremetal \ + --config sims.p2e.P2EToyConfig \ + --batch' +``` + +## 性能指标 + +每个基准测试测量: + +- **周期计数**:完成的总周期数 +- **已执行指令**:动态指令计数 +- **内存流量**:加载/存储操作 +- **完成时间**:模拟中的挂钟时间 + +### 提取结果 + +模拟输出包括基准测试统计: + +``` +Benchmark: sha256 +Cycles: 142857 +Instructions: 98765 +Memory ops: 12345 +``` + +解析这些指标来评估: + +1. **指令效率**:每周期指令数(IPC) +2. **内存效率**:缓存命中率和带宽利用率 +3. **计算密度**:后硅分析中每瓦特操作数 + +## 基准测试详情 + +### aha-mont64(密码学) + +椭圆曲线操作的 Montgomery 乘法。测试: +- 模运算性能 +- 重计算下的寄存器压力 +- 大整数数值稳定性 + +Buckyball Toy 上的预期周期:50,000–100,000 + +### nettle-aes(加密) + +AES 分组密码实现。测试: +- 查找表效率(S-box 访问模式) +- 紧密循环性能 +- 数据依赖缓存行为 + +预期周期:200,000–300,000 + +### nettle-sha256(哈希) + +SHA-256 密码学哈希。测试: +- 按位操作效率 +- 状态更新期间的内存访问模式 +- 分支预测与循环密集型代码 + +预期周期:80,000–150,000 + +### matmult-int(矩阵乘法) + +可配置大小的整数矩阵乘法。测试: +- 循环嵌套优化 +- 2D 数据访问的缓存局部性 +- 算术管道利用率 + +预期周期:10,000–50,000(大小依赖) + +### nbody(物理模拟) + +N 体引力模拟。测试: +- 浮点计算密度 +- 不规则内存访问模式 +- 数值核编译器优化 + +预期周期:500,000–1,000,000 + +## 自定义 + +### 添加自定义基准测试 + +1. 在 `src/` 下创建目录: + ```bash + mkdir -p bb-tests/workloads/src/CTest/toy/embench/src/my-bench + ``` + +2. 用 C 实现基准测试: + ```c + // my-bench/mybench.c + #include "../support/support.h" + + int main() { + int result = 0; + // 基准测试计算 + return result; + } + ``` + +3. 更新 `CMakeLists.txt` 以在构建中包括新基准测试 + +4. 重新构建并通过标准测试基础设施运行 + +### 修改基准测试参数 + +某些基准测试支持可配置参数(矩阵大小、迭代计数等)。通过以下方式修改: + +- `CMakeLists.txt` 中的预处理器定义 +- `crt0.S` 中的环境变量 +- 基准测试 `.c` 文件中的直接源修改 + +## 解释结果 + +### 性能回归检测 + +跨 Buckyball 发布追踪基准测试周期计数: + +```bash +# 基线(上一个发布) +baseline_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles) + +# 当前发布 +current_cycles=$(bbdev verilator --run --binary embench/sha256-baremetal | grep Cycles) + +# 计算回归 +regression=$(( (current_cycles - baseline_cycles) * 100 / baseline_cycles )) +echo "Performance change: ${regression}%" +``` + +回归 > 5% 表示潜在的架构或编译器问题。 + +### 工作负载分类 + +使用 Embench 对不同应用域的 Buckyball 适用性进行分类: + +- **密码学密集**:运行 aha-mont64、nettle-aes、nettle-sha256;与目标 ASIC 比较 +- **数据处理**:运行 sglib-combined、huffbench;测量内存效率 +- **数值**:运行 nbody、cubic;评估浮点管道利用率 + +## 已知问题 + +### 大问题规模上基准测试挂起 + +某些基准测试(如 nsichneu)在模拟中可能超时,问题数据集很大。减少迭代计数或问题大小: + +```bash +# 编辑基准测试源以减少问题大小 +sed -i 's/MAX_ITERATIONS 1000000/MAX_ITERATIONS 10000/' src/nsichneu/libnsichneu.c +``` + +### 嵌入式上下文中的内存溢出 + +Embench 基准测试设计用于标准 C 环境。某些(如 picojpeg)需要大缓冲区。验证可用内存: + +```bash +# 检查链接器脚本 DRAM 大小 +grep -A 2 "DRAM :" *.ld +``` + +如果不足,启用核外模拟或增加模拟配置中的 DDR 大小。 + +## 参考资源 + +- **Embench 官方**:https://www.embench.org/ +- **RISC-V 软件约定**:https://github.com/riscv-non-isa/riscv-elf-psabi-doc +- **Buckyball 工作负载集成**:`bb-tests/workloads/README.md`