-
Core Functionality:
- Convolution (1x1 to 7x7 kernels)
- Pooling (max/avg)
- Activation (ReLU, Sigmoid, Softmax)
- DMA streaming from/to DDR
-
Platform Constraints:
- XC7Z020: 220 DSPs, 140 BRAM36K, ~53K LUTs
- 2-month timeline with 4-5 students
- Student project (simplicity matters)
-
Evaluation Priorities (from requirements):
- Architecture (45%): Parameterization, scalability, pipelining, throughput/resource
- RTL quality (30%): Code quality, timing
- Verification (10%)
- SoC demo (15%)
-
Over-engineered for a student project:
- 700+ line top_controller.v with complex stall logic
- Scoreboarding, hazard detection
- 13 instruction types
- Multiple FSMs interacting
-
Row-stationary dataflow complexity:
- 55-cycle psum drain latency
- Complex PE interconnect
- Valid signal propagation through 14 rows
-
ISA complexity:
- 13 instructions, some dual-word
- Hardware loop counters
- Scoreboarding
Instead of a 2D PE array with complex dataflow, use a 1D processing pipeline:
Input Stream → Line Buffer → Sliding Window → MAC Array → Accumulator → Activation → Output Stream
Advantages:
- No complex 2D interconnect
- Natural streaming (matches DMA model)
- Easier to understand and verify
- No psum chain latency issues
For convolution:
- Stream input pixels through line buffers (BRAM)
- Sliding window extracts KxK patch
- MAC array computes dot product in parallel
- Accumulate across input channels
- Apply activation
- Stream out
Keep the 2D array but simplify:
-
Output-stationary dataflow instead of row-stationary:
- Weights stream through
- Inputs broadcast
- Outputs accumulate locally
- Simpler control, no psum chain
-
Simpler control:
- Remove scoreboarding (use explicit sync instructions)
- Remove hardware loops (software handles iteration)
- Single-word instructions only
-
Single FSM controller:
- IDLE → LOAD_WEIGHTS → LOAD_INPUTS → COMPUTE → STORE_OUTPUTS → IDLE
- No pipeline hazards to handle
Focus on what actually needs to be demonstrated:
- One working convolution layer
- One working pooling layer
- Activation functions
- DMA transfers
Design:
-
Single convolution unit (not array):
- 49 parallel MACs (max 7x7 kernel)
- Input buffer with sliding window logic
- Weight BRAM
- Output accumulator
-
Simple state machine:
- ARM programs kernel size, input dims, output dims
- ARM triggers start
- Hardware processes entire tile
- Hardware signals done
- ARM reads results
-
Fewer instructions:
- START_CONV (with parameters in registers)
- START_POOL
- STATUS_CHECK
- That's it.
For a 2-month student project, I'd recommend:
DDR → DMA → Line Buffers → Window Gen → Systolic MAC Chain → Accumulator → Activation → DMA → DDR
↑
Weight BRAM
Key Components:
-
Line Buffer (BRAM-based):
- Stores K rows for KxK convolution
- Natural sliding window extraction
- Double-buffered for continuous streaming
-
1D Systolic MAC Array (1×N, not NxN):
- N = 14 to 49 (depending on resources)
- Each PE: weight_reg × input → accumulator
- Simpler than 2D: only horizontal data flow
- Weights shift horizontally, inputs broadcast
-
Simple Control:
- AXI-Lite registers for configuration
- Single FSM: IDLE → CONV → POOL → DONE
- No instruction memory (configuration registers only)
-
Pooling Unit:
- Separate 2×2 comparator tree
- Shares line buffer with convolution
Instead of an instruction memory and decoder, use AXI-Lite configuration registers:
| Register | Name | Description |
|---|---|---|
| 0x00 | CTRL | [0]=start, [1]=pool_mode, [2]=relu_en |
| 0x04 | STATUS | [0]=busy, [1]=done |
| 0x08 | INPUT_ADDR | DDR address for input |
| 0x0C | OUTPUT_ADDR | DDR address for output |
| 0x10 | WEIGHT_ADDR | DDR address for weights |
| 0x14 | INPUT_SIZE | {height[15:8], width[7:0]} |
| 0x18 | KERNEL_SIZE | {kh[3:0], kw[3:0]} |
| 0x1C | CHANNELS | {out_ch[15:8], in_ch[7:0]} |
Why this is better:
- ARM software handles layer iteration (it's doing this anyway)
- No instruction decoder complexity
- No instruction memory BRAM cost
- Simpler verification (just test each config)
- Easier to demonstrate (change register, see result)
For each output pixel (r, c):
For each output channel (oc):
acc = 0
For each input channel (ic):
For each kernel position (kr, kc):
acc += weight[oc][ic][kr][kc] * input[ic][r+kr][c+kc]
output[oc][r][c] = relu(acc)
Hardware mapping:
- Outer loops (r, c, oc) controlled by FSM counter
- Inner loops (ic, kr, kc) unrolled in parallel MACs
- Line buffer provides sliding window access
- Weight BRAM pre-loaded before layer starts
-
Simpler to Understand:
- Linear data flow (input → process → output)
- Single FSM, not multiple interacting FSMs
- No complex dataflow (row/weight/output stationary)
-
Easier to Verify:
- Fewer states
- No hazard detection needed
- Test one layer at a time
-
Better Resource Utilization:
- Line buffers reuse BRAM efficiently
- No wasted PE cycles waiting for psum chain
-
Matches DMA Model:
- Streaming naturally aligns with DMA
- No double-buffering complexity
-
Achievable in 2 Months:
- Each component is independently testable
- Incremental development possible
- Demo-ready faster
Cons:
- Lower peak throughput than 2D array
- Less "impressive" on paper
- Won't process as many MACs per cycle
Pros:
- Actually works and can be demonstrated
- Easier to debug when things go wrong
- Better timing (shorter critical paths)
- More time for optimization if schedule allows
Let me sketch out what each component would look like...
module line_buffer #(
parameter WIDTH = 224,
parameter DEPTH = 7, // max kernel height
parameter DATA_W = 8
) (
input clk, rst,
input [DATA_W-1:0] pixel_in,
input pixel_valid,
output [DATA_W*DEPTH*DEPTH-1:0] window_out, // KxK window
output window_valid
);
// K row FIFOs implemented in BRAM
// Sliding window extraction combinational logic
endmodulemodule mac_unit (
input clk, rst,
input signed [7:0] input_val,
input signed [7:0] weight_val,
input clear,
output reg signed [31:0] accum
);
always @(posedge clk) begin
if (rst || clear)
accum <= 0;
else
accum <= accum + input_val * weight_val;
end
endmodulemodule conv_engine #(
parameter MAX_K = 7,
parameter NUM_MACS = 49 // MAX_K * MAX_K
) (
input clk, rst,
// Configuration
input [2:0] kernel_h, kernel_w,
input start,
output done,
// Window input (from line buffer)
input [8*MAX_K*MAX_K-1:0] window,
input window_valid,
// Weights (pre-loaded)
input [8*NUM_MACS-1:0] weights,
// Output
output signed [31:0] result,
output result_valid
);
// Instantiate NUM_MACS mac_units
// Parallel multiply-add tree
// Single-cycle dot product (combinational reduction tree)
endmodulemodule conv_ctrl (
input clk, rst,
// AXI-Lite config (simplified)
input [31:0] cfg_input_addr,
input [31:0] cfg_output_addr,
input [15:0] cfg_input_h, cfg_input_w,
input [3:0] cfg_kernel_h, cfg_kernel_w,
input start,
output reg done, busy,
// DMA interface
output reg dma_rd_start,
output reg [31:0] dma_rd_addr,
input [63:0] dma_rd_data,
input dma_rd_valid,
output reg dma_wr_start,
output reg [31:0] dma_wr_addr,
output reg [63:0] dma_wr_data,
input dma_wr_ready
);
localparam IDLE = 0, LOAD_WEIGHTS = 1, PROCESS = 2, DONE = 3;
reg [1:0] state;
always @(posedge clk) begin
case (state)
IDLE: if (start) state <= LOAD_WEIGHTS;
LOAD_WEIGHTS: if (weights_loaded) state <= PROCESS;
PROCESS: if (output_complete) state <= DONE;
DONE: state <= IDLE;
endcase
end
endmoduleThis is MUCH simpler than the current 700-line top_controller.v with scoreboarding and multiple FSMs.
The current design is impressive from an academic standpoint but over-engineered for the requirements. A simpler design would:
- Use 1D systolic array or parallel MAC tree (not 2D PE grid)
- Use register-based configuration (not instruction memory)
- Use line buffers for sliding window (natural for convolution)
- Use single simple FSM (not multiple interacting FSMs)
- Focus on demonstrable functionality over peak throughput
This would score well on:
- Architecture (45%): Still demonstrates parallelism, pipelining, scalability
- RTL quality (30%): Cleaner, simpler code that's easier to time
- Verification (10%): Much easier to test
- Demo (15%): More likely to actually work