Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions content/en/Architecture/Cache Hierarchy and Private Data Cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Cache Hierarchy and Private Data Cache

## Overview

Buckyball employs a configurable cache hierarchy to optimize memory access patterns for diverse workloads. Each tile supports private per-core instruction and data caches, with optional per-tile inclusive L2 caches. The recent redesign of the private data cache (dcache) simplifies coherency semantics when private caches are present, shifting memory management responsibility to software.

## Cache Architecture

### Core-Level Caches

Each Rocket core in a tile has:

- **Instruction Cache (I-Cache)**: Private, read-only, typically 16–32 KB
- **Data Cache (D-Cache)**: Private, read-write, typically 16–32 KB

Both caches use write-through semantics to the L1 miss handling logic.

### Per-Tile Inclusive L2 Cache

An optional per-tile L2 cache sits between the core's L1 caches and the system interconnect. The L2 is **inclusive** — it holds a superset of L1 cache lines.

**L2 Configuration Parameters**:

- **ways**: Cache associativity (typically 8–16)
- **sets**: Number of sets per way (typically 256–512)
- **writeBytes**: Write buffer depth
- **portFactor**: Memory port provisioning factor
- **memCycles**: Estimated memory latency for performance modeling

**Topology with L2**:

```
Core L1 I-Cache → ──┐
Core L1 D-Cache → ──┤ Inclusive L2 → Cork Unit → System Memory
──┘
```

The cork unit (TLCacheCork) manages coherency between the L2 and other system agents.

### System-Level Coherency

When a tile has a private L2 cache, the system disables distributed coherency for that tile's memory domain. This is because:

1. The inclusive L2 acts as a **single point of coherency** for all L1 accesses from cores in that tile
2. External agents (other tiles, DMA) cannot directly observe L1 state
3. Software must manage coherency explicitly through memory barriers and cache flush operations

This design is known as **software-managed coherency** and is appropriate for:

- Workloads with static data partitioning (e.g., SPMD)
- Systems where inter-tile communication is infrequent
- Configurations prioritizing energy efficiency over automatic coherency

## Private Data Cache Redesign

### Motivation

Prior Buckyball versions could maintain hardware-managed coherency even with private L1 caches. This required complex coherency protocols and increased logic overhead.

The new design simplifies the tile by requiring software to handle coherency when private caches are configured. Hardware no longer maintains a last-level cache (LLC) in the coherent subsystem when tile caches are private.

### Configuration Model

**Private Cache Mode**:

When a tile is configured with private dcache:

- No LLC is present in the coherent subsystem for that tile's memory domain
- All L1 misses flow through the tile's memory backend
- Software must flush caches or use memory barriers to ensure coherency

**Implications**:

1. **Data Consistency**: Software must explicitly synchronize caches before sharing data with other tiles
2. **Memory Barriers**: Add barriers around shared-memory access to enforce ordering
3. **Cache Flush**: Use platform-specific cache flush instructions before publishing data

### Example: Multi-Tile SPMD with Private Caches

```c
#include <stdint.h>

#define BARRIER_ADDR 0x60000000 // SCU barrier address (multi-hart SCU)
#define SHARED_DATA_ADDR 0x80100000

typedef struct {
int tile_id;
int result;
} SharedData;

void scu_barrier() {
volatile int *barrier = (volatile int *)BARRIER_ADDR;
*barrier = 1; // Write barrier address to block all harts
}

void cache_flush_dcache() {
// Platform-specific: flush entire D-cache
// Example for RISC-V with custom CSR:
asm volatile("fence" ::: "memory");
}

int main() {
int tile_id = bb_get_tile_id();
SharedData *shared = (SharedData *)SHARED_DATA_ADDR;

// Phase 1: Per-tile computation
int local_result = compute_tile_result(tile_id);

// Phase 2: Flush cache before publishing
cache_flush_dcache();
shared->tile_id = tile_id;
shared->result = local_result;
cache_flush_dcache(); // Ensure write is visible

// Phase 3: Global synchronization
scu_barrier();

// Phase 4: Read published data from other tiles
// Data is now globally visible
int peer_result = shared->result;

return 0;
}
```

## Configuration Examples

### Enabling Private L2 Cache

Define a custom config with L2:

```scala
import org.chipsalliance.cde.config._
import freechips.rocketchip.subsystem._

class GobanWithPrivateL2 extends Config(
new examples.goban.BuckyballGoban4T16CConfig ++
new WithL2Cache(
ways = 8,
sets = 512,
writeBytes = 64,
portFactor = 1,
memCycles = 10
)
)
```

### Disabling L2 (Default)

By default, Goban uses only core L1 caches:

```scala
class GobanL1Only extends Config(
new examples.goban.BuckyballGoban4T16CConfig
)
```

With this configuration, each tile operates independently with software-managed inter-tile coherency.

## Performance Considerations

### Private Cache Benefits

- **Reduced coherency overhead**: No broadcast bus for L1 evictions
- **Predictable memory timing**: Private L2 eliminates conflict misses from other tiles
- **Energy efficiency**: Lower coherency traffic reduces power consumption

### Software Coherency Costs

- **Explicit flushes**: Cache management instructions increase code size and latency
- **False sharing**: Software must partition data carefully to avoid cache line conflicts
- **Synchronization latency**: Barriers impose serialization points

## Troubleshooting

### Data Coherency Issues

**Symptom**: Values written by one tile are not visible to other tiles after synchronization.

**Check**:

1. Verify cache flush instructions are present before publishing data
2. Use memory barriers (`fence` in RISC-V) before and after shared-memory access
3. Confirm synchronization point (e.g., `scu_barrier()`) is called after flush

### Performance Degradation

**Symptom**: Execution time increases significantly with L2 disabled compared to hardware-coherent mode.

**Check**:

1. Profile memory access patterns to identify frequent cache misses
2. Consider enabling L2 if L1 miss rate is high (>10%)
3. Verify working set fits within combined L1+L2 capacity

## References

- **Inclusive Cache**: SiFive InclusiveCache documentation
- **RISC-V Memory Ordering**: RISC-V ISA specification, Chapter 8 (Memory Model)
- **Cork Unit**: TileLink cache management in Rocket Chip
23 changes: 17 additions & 6 deletions content/en/Architecture/Goban Multi-Core Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,26 @@ Goban is a multi-core BBTile configuration in Buckyball that enables parallel ex

### Configuration Variants

**BuckyballGobanConfig**
- 1 BBTile × 4 cores
Goban supports multiple configuration sizes:

**1t4c** — 1 tile × 4 cores
- 4 Rocket cores + 4 BuckyballAccelerators
- Single SharedMem + BarrierUnit
- Minimal memory footprint, suitable for single-tile testing

**4t16c** — 4 tiles × 4 cores = 16 total cores
- 16 Rocket cores + 16 BuckyballAccelerators
- Per-tile memory domains and synchronization

**8t8c** — 8 tiles × 8 cores = 64 total cores
- 64 Rocket cores + 64 BuckyballAccelerators
- Per-tile synchronization, scaled memory system

**Legacy configurations:**
- `BuckyballGobanConfig` — 1 BBTile × 4 cores
- `BuckyballGoban2TileConfig` — 2 BBTiles × 4 cores = 8 total cores

**BuckyballGoban2TileConfig**
- 2 BBTiles × 4 cores = 8 total cores
- 8 Rocket cores + 8 BuckyballAccelerators
- Per-tile SharedMem + BarrierUnit
All variants maintain the same per-core execution model and barrier synchronization semantics across tiles.

## Core Components

Expand Down
Loading