SIMD_CNN_Processor/new.md at main · Bavan2002/SIMD_CNN_Processor

Requirements Analysis

Core Functionality:
- Convolution (1x1 to 7x7 kernels)
- Pooling (max/avg)
- Activation (ReLU, Sigmoid, Softmax)
- DMA streaming from/to DDR
Platform Constraints:
- XC7Z020: 220 DSPs, 140 BRAM36K, ~53K LUTs
- 2-month timeline with 4-5 students
- Student project (simplicity matters)
Evaluation Priorities (from requirements):
- Architecture (45%): Parameterization, scalability, pipelining, throughput/resource
- RTL quality (30%): Code quality, timing
- Verification (10%)
- SoC demo (15%)

Current Design Issues

Over-engineered for a student project:
- 700+ line top_controller.v with complex stall logic
- Scoreboarding, hazard detection
- 13 instruction types
- Multiple FSMs interacting
Row-stationary dataflow complexity:
- 55-cycle psum drain latency
- Complex PE interconnect
- Valid signal propagation through 14 rows
ISA complexity:
- 13 instructions, some dual-word
- Hardware loop counters
- Scoreboarding

What Would Be Simpler/Better?

Option A: Streaming Architecture (Simpler)

Instead of a 2D PE array with complex dataflow, use a 1D processing pipeline:

Input Stream → Line Buffer → Sliding Window → MAC Array → Accumulator → Activation → Output Stream

Advantages:

No complex 2D interconnect
Natural streaming (matches DMA model)
Easier to understand and verify
No psum chain latency issues

For convolution:

Stream input pixels through line buffers (BRAM)
Sliding window extracts KxK patch
MAC array computes dot product in parallel
Accumulate across input channels
Apply activation
Stream out

Option B: Simplified PE Array (Medium Complexity)

Keep the 2D array but simplify:

Output-stationary dataflow instead of row-stationary:
- Weights stream through
- Inputs broadcast
- Outputs accumulate locally
- Simpler control, no psum chain
Simpler control:
- Remove scoreboarding (use explicit sync instructions)
- Remove hardware loops (software handles iteration)
- Single-word instructions only
Single FSM controller:
- IDLE → LOAD_WEIGHTS → LOAD_INPUTS → COMPUTE → STORE_OUTPUTS → IDLE
- No pipeline hazards to handle

Option C: Memory-Centric Design (Pragmatic for Demo)

Focus on what actually needs to be demonstrated:

One working convolution layer
One working pooling layer
Activation functions
DMA transfers

Design:

Single convolution unit (not array):
- 49 parallel MACs (max 7x7 kernel)
- Input buffer with sliding window logic
- Weight BRAM
- Output accumulator
Simple state machine:
- ARM programs kernel size, input dims, output dims
- ARM triggers start
- Hardware processes entire tile
- Hardware signals done
- ARM reads results
Fewer instructions:
- START_CONV (with parameters in registers)
- START_POOL
- STATUS_CHECK
- That's it.

My Recommendation: Hybrid Approach

For a 2-month student project, I'd recommend:

Architecture: 1D Systolic Array with Line Buffers

DDR → DMA → Line Buffers → Window Gen → Systolic MAC Chain → Accumulator → Activation → DMA → DDR
                              ↑
                          Weight BRAM

Key Components:

Line Buffer (BRAM-based):
- Stores K rows for KxK convolution
- Natural sliding window extraction
- Double-buffered for continuous streaming
1D Systolic MAC Array (1×N, not NxN):
- N = 14 to 49 (depending on resources)
- Each PE: weight_reg × input → accumulator
- Simpler than 2D: only horizontal data flow
- Weights shift horizontally, inputs broadcast
Simple Control:
- AXI-Lite registers for configuration
- Single FSM: IDLE → CONV → POOL → DONE
- No instruction memory (configuration registers only)
Pooling Unit:
- Separate 2×2 comparator tree
- Shares line buffer with convolution

ISA: Register-Based Configuration (No Instructions!)

Instead of an instruction memory and decoder, use AXI-Lite configuration registers:

Register	Name	Description
0x00	CTRL	[0]=start, [1]=pool_mode, [2]=relu_en
0x04	STATUS	[0]=busy, [1]=done
0x08	INPUT_ADDR	DDR address for input
0x0C	OUTPUT_ADDR	DDR address for output
0x10	WEIGHT_ADDR	DDR address for weights
0x14	INPUT_SIZE	{height[15:8], width[7:0]}
0x18	KERNEL_SIZE	{kh[3:0], kw[3:0]}
0x1C	CHANNELS	{out_ch[15:8], in_ch[7:0]}

Why this is better:

ARM software handles layer iteration (it's doing this anyway)
No instruction decoder complexity
No instruction memory BRAM cost
Simpler verification (just test each config)
Easier to demonstrate (change register, see result)

Dataflow: Output-Stationary with Streaming

For each output pixel (r, c):
    For each output channel (oc):
        acc = 0
        For each input channel (ic):
            For each kernel position (kr, kc):
                acc += weight[oc][ic][kr][kc] * input[ic][r+kr][c+kc]
        output[oc][r][c] = relu(acc)

Hardware mapping:

Outer loops (r, c, oc) controlled by FSM counter
Inner loops (ic, kr, kc) unrolled in parallel MACs
Line buffer provides sliding window access
Weight BRAM pre-loaded before layer starts

Why This Design is Better

Simpler to Understand:
- Linear data flow (input → process → output)
- Single FSM, not multiple interacting FSMs
- No complex dataflow (row/weight/output stationary)
Easier to Verify:
- Fewer states
- No hazard detection needed
- Test one layer at a time
Better Resource Utilization:
- Line buffers reuse BRAM efficiently
- No wasted PE cycles waiting for psum chain
Matches DMA Model:
- Streaming naturally aligns with DMA
- No double-buffering complexity
Achievable in 2 Months:
- Each component is independently testable
- Incremental development possible
- Demo-ready faster

Trade-offs

Cons:

Lower peak throughput than 2D array
Less "impressive" on paper
Won't process as many MACs per cycle

Pros:

Actually works and can be demonstrated
Easier to debug when things go wrong
Better timing (shorter critical paths)
More time for optimization if schedule allows

Detailed Component Design

Let me sketch out what each component would look like...

Line Buffer

module line_buffer #(
    parameter WIDTH = 224,
    parameter DEPTH = 7,  // max kernel height
    parameter DATA_W = 8
) (
    input clk, rst,
    input [DATA_W-1:0] pixel_in,
    input pixel_valid,
    output [DATA_W*DEPTH*DEPTH-1:0] window_out,  // KxK window
    output window_valid
);
    // K row FIFOs implemented in BRAM
    // Sliding window extraction combinational logic
endmodule

MAC Unit (Simplified)

module mac_unit (
    input clk, rst,
    input signed [7:0] input_val,
    input signed [7:0] weight_val,
    input clear,
    output reg signed [31:0] accum
);
    always @(posedge clk) begin
        if (rst || clear)
            accum <= 0;
        else
            accum <= accum + input_val * weight_val;
    end
endmodule

Convolution Engine

module conv_engine #(
    parameter MAX_K = 7,
    parameter NUM_MACS = 49  // MAX_K * MAX_K
) (
    input clk, rst,

    // Configuration
    input [2:0] kernel_h, kernel_w,
    input start,
    output done,

    // Window input (from line buffer)
    input [8*MAX_K*MAX_K-1:0] window,
    input window_valid,

    // Weights (pre-loaded)
    input [8*NUM_MACS-1:0] weights,

    // Output
    output signed [31:0] result,
    output result_valid
);
    // Instantiate NUM_MACS mac_units
    // Parallel multiply-add tree
    // Single-cycle dot product (combinational reduction tree)
endmodule

Top Controller (Simple FSM)

module conv_ctrl (
    input clk, rst,

    // AXI-Lite config (simplified)
    input [31:0] cfg_input_addr,
    input [31:0] cfg_output_addr,
    input [15:0] cfg_input_h, cfg_input_w,
    input [3:0] cfg_kernel_h, cfg_kernel_w,
    input start,
    output reg done, busy,

    // DMA interface
    output reg dma_rd_start,
    output reg [31:0] dma_rd_addr,
    input [63:0] dma_rd_data,
    input dma_rd_valid,

    output reg dma_wr_start,
    output reg [31:0] dma_wr_addr,
    output reg [63:0] dma_wr_data,
    input dma_wr_ready
);
    localparam IDLE = 0, LOAD_WEIGHTS = 1, PROCESS = 2, DONE = 3;
    reg [1:0] state;

    always @(posedge clk) begin
        case (state)
            IDLE: if (start) state <= LOAD_WEIGHTS;
            LOAD_WEIGHTS: if (weights_loaded) state <= PROCESS;
            PROCESS: if (output_complete) state <= DONE;
            DONE: state <= IDLE;
        endcase
    end
endmodule

This is MUCH simpler than the current 700-line top_controller.v with scoreboarding and multiple FSMs.

Summary

The current design is impressive from an academic standpoint but over-engineered for the requirements. A simpler design would:

Use 1D systolic array or parallel MAC tree (not 2D PE grid)
Use register-based configuration (not instruction memory)
Use line buffers for sliding window (natural for convolution)
Use single simple FSM (not multiple interacting FSMs)
Focus on demonstrable functionality over peak throughput

This would score well on:

Architecture (45%): Still demonstrates parallelism, pipelining, scalability
RTL quality (30%): Cleaner, simpler code that's easier to time
Verification (10%): Much easier to test
Demo (15%): More likely to actually work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements Analysis

Current Design Issues

What Would Be Simpler/Better?

Option A: Streaming Architecture (Simpler)

Option B: Simplified PE Array (Medium Complexity)

Option C: Memory-Centric Design (Pragmatic for Demo)

My Recommendation: Hybrid Approach

Architecture: 1D Systolic Array with Line Buffers

ISA: Register-Based Configuration (No Instructions!)

Dataflow: Output-Stationary with Streaming

Why This Design is Better

Trade-offs

Detailed Component Design

Line Buffer

MAC Unit (Simplified)

Convolution Engine

Top Controller (Simple FSM)

Summary

FilesExpand file tree

new.md

Latest commit

History

new.md

File metadata and controls

Requirements Analysis

Current Design Issues

What Would Be Simpler/Better?

Option A: Streaming Architecture (Simpler)

Option B: Simplified PE Array (Medium Complexity)

Option C: Memory-Centric Design (Pragmatic for Demo)

My Recommendation: Hybrid Approach

Architecture: 1D Systolic Array with Line Buffers

ISA: Register-Based Configuration (No Instructions!)

Dataflow: Output-Stationary with Streaming

Why This Design is Better

Trade-offs

Detailed Component Design

Line Buffer

MAC Unit (Simplified)

Convolution Engine

Top Controller (Simple FSM)

Summary