DangoSys · cursor · May 18, 2026
diff --git a/content/en/Architecture/Cache Hierarchy and Private Data Cache.md b/content/en/Architecture/Cache Hierarchy and Private Data Cache.md
@@ -0,0 +1,200 @@
+# Cache Hierarchy and Private Data Cache
+
+## Overview
+
+Buckyball employs a configurable cache hierarchy to optimize memory access patterns for diverse workloads. Each tile supports private per-core instruction and data caches, with optional per-tile inclusive L2 caches. The recent redesign of the private data cache (dcache) simplifies coherency semantics when private caches are present, shifting memory management responsibility to software.
+
+## Cache Architecture
+
+### Core-Level Caches
+
+Each Rocket core in a tile has:
+
+- **Instruction Cache (I-Cache)**: Private, read-only, typically 16–32 KB
+- **Data Cache (D-Cache)**: Private, read-write, typically 16–32 KB
+
+Both caches use write-through semantics to the L1 miss handling logic.
+
+### Per-Tile Inclusive L2 Cache
+
+An optional per-tile L2 cache sits between the core's L1 caches and the system interconnect. The L2 is **inclusive** — it holds a superset of L1 cache lines.
+
+**L2 Configuration Parameters**:
+
+- **ways**: Cache associativity (typically 8–16)
+- **sets**: Number of sets per way (typically 256–512)
+- **writeBytes**: Write buffer depth
+- **portFactor**: Memory port provisioning factor
+- **memCycles**: Estimated memory latency for performance modeling
+
+**Topology with L2**:
+
+```
+Core L1 I-Cache → ──┐
+Core L1 D-Cache → ──┤ Inclusive L2 → Cork Unit → System Memory
+                  ──┘
+```
+
+The cork unit (TLCacheCork) manages coherency between the L2 and other system agents.
+
+### System-Level Coherency
+
+When a tile has a private L2 cache, the system disables distributed coherency for that tile's memory domain. This is because:
+
+1. The inclusive L2 acts as a **single point of coherency** for all L1 accesses from cores in that tile
+2. External agents (other tiles, DMA) cannot directly observe L1 state
+3. Software must manage coherency explicitly through memory barriers and cache flush operations
+
+This design is known as **software-managed coherency** and is appropriate for:
+
+- Workloads with static data partitioning (e.g., SPMD)
+- Systems where inter-tile communication is infrequent
+- Configurations prioritizing energy efficiency over automatic coherency
+
+## Private Data Cache Redesign
+
+### Motivation
+
+Prior Buckyball versions could maintain hardware-managed coherency even with private L1 caches. This required complex coherency protocols and increased logic overhead.
+
+The new design simplifies the tile by requiring software to handle coherency when private caches are configured. Hardware no longer maintains a last-level cache (LLC) in the coherent subsystem when tile caches are private.
+
+### Configuration Model
+
+**Private Cache Mode**:
+
+When a tile is configured with private dcache:
+
+- No LLC is present in the coherent subsystem for that tile's memory domain
+- All L1 misses flow through the tile's memory backend
+- Software must flush caches or use memory barriers to ensure coherency
+
+**Implications**:
+
+1. **Data Consistency**: Software must explicitly synchronize caches before sharing data with other tiles
+2. **Memory Barriers**: Add barriers around shared-memory access to enforce ordering
+3. **Cache Flush**: Use platform-specific cache flush instructions before publishing data
+
+### Example: Multi-Tile SPMD with Private Caches
+
+```c
+#include <stdint.h>
+
+#define BARRIER_ADDR 0x60000000  // SCU barrier address (multi-hart SCU)
+#define SHARED_DATA_ADDR 0x80100000
+
+typedef struct {
+  int tile_id;
+  int result;
+} SharedData;
+
+void scu_barrier() {
+  volatile int *barrier = (volatile int *)BARRIER_ADDR;
+  *barrier = 1;  // Write barrier address to block all harts
+}
+
+void cache_flush_dcache() {
+  // Platform-specific: flush entire D-cache
+  // Example for RISC-V with custom CSR:
+  asm volatile("fence" ::: "memory");
+}
+
+int main() {
+  int tile_id = bb_get_tile_id();
+  SharedData *shared = (SharedData *)SHARED_DATA_ADDR;
+
+  // Phase 1: Per-tile computation
+  int local_result = compute_tile_result(tile_id);
+
+  // Phase 2: Flush cache before publishing
+  cache_flush_dcache();
+  shared->tile_id = tile_id;
+  shared->result = local_result;
+  cache_flush_dcache();  // Ensure write is visible
+
+  // Phase 3: Global synchronization
+  scu_barrier();
+
+  // Phase 4: Read published data from other tiles
+  // Data is now globally visible
+  int peer_result = shared->result;
+
+  return 0;
+}
+```
+
+## Configuration Examples
+
+### Enabling Private L2 Cache
+
+Define a custom config with L2:
+
+```scala
+import org.chipsalliance.cde.config._
+import freechips.rocketchip.subsystem._
+
+class GobanWithPrivateL2 extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig ++
+    new WithL2Cache(
+      ways = 8,
+      sets = 512,
+      writeBytes = 64,
+      portFactor = 1,
+      memCycles = 10
+    )
+)
+```
+
+### Disabling L2 (Default)
+
+By default, Goban uses only core L1 caches:
+
+```scala
+class GobanL1Only extends Config(
+  new examples.goban.BuckyballGoban4T16CConfig
+)
+```
+
+With this configuration, each tile operates independently with software-managed inter-tile coherency.
+
+## Performance Considerations
+
+### Private Cache Benefits
+
+- **Reduced coherency overhead**: No broadcast bus for L1 evictions
+- **Predictable memory timing**: Private L2 eliminates conflict misses from other tiles
+- **Energy efficiency**: Lower coherency traffic reduces power consumption
+
+### Software Coherency Costs
+
+- **Explicit flushes**: Cache management instructions increase code size and latency
+- **False sharing**: Software must partition data carefully to avoid cache line conflicts
+- **Synchronization latency**: Barriers impose serialization points
+
+## Troubleshooting
+
+### Data Coherency Issues
+
+**Symptom**: Values written by one tile are not visible to other tiles after synchronization.
+
+**Check**:
+
+1. Verify cache flush instructions are present before publishing data
+2. Use memory barriers (`fence` in RISC-V) before and after shared-memory access
+3. Confirm synchronization point (e.g., `scu_barrier()`) is called after flush
+
+### Performance Degradation
+
+**Symptom**: Execution time increases significantly with L2 disabled compared to hardware-coherent mode.
+
+**Check**:
+
+1. Profile memory access patterns to identify frequent cache misses
+2. Consider enabling L2 if L1 miss rate is high (>10%)
+3. Verify working set fits within combined L1+L2 capacity
+
+## References
+
+- **Inclusive Cache**: SiFive InclusiveCache documentation
+- **RISC-V Memory Ordering**: RISC-V ISA specification, Chapter 8 (Memory Model)
+- **Cork Unit**: TileLink cache management in Rocket Chip
diff --git a/content/en/Architecture/Goban Multi-Core Architecture.md b/content/en/Architecture/Goban Multi-Core Architecture.md
@@ -22,15 +22,26 @@ Goban is a multi-core BBTile configuration in Buckyball that enables parallel ex
 
 ### Configuration Variants
 
-**BuckyballGobanConfig**
-- 1 BBTile × 4 cores
+Goban supports multiple configuration sizes:
+
+**1t4c** — 1 tile × 4 cores
 - 4 Rocket cores + 4 BuckyballAccelerators
 - Single SharedMem + BarrierUnit
+- Minimal memory footprint, suitable for single-tile testing
+
+**4t16c** — 4 tiles × 4 cores = 16 total cores
+- 16 Rocket cores + 16 BuckyballAccelerators
+- Per-tile memory domains and synchronization
+
+**8t8c** — 8 tiles × 8 cores = 64 total cores
+- 64 Rocket cores + 64 BuckyballAccelerators
+- Per-tile synchronization, scaled memory system
+
+**Legacy configurations:**
+- `BuckyballGobanConfig` — 1 BBTile × 4 cores
+- `BuckyballGoban2TileConfig` — 2 BBTiles × 4 cores = 8 total cores
 
-**BuckyballGoban2TileConfig**
-- 2 BBTiles × 4 cores = 8 total cores
-- 8 Rocket cores + 8 BuckyballAccelerators
-- Per-tile SharedMem + BarrierUnit
+All variants maintain the same per-core execution model and barrier synchronization semantics across tiles.
 
 ## Core Components