Skip to content

Latest commit

 

History

History
358 lines (278 loc) · 9.75 KB

File metadata and controls

358 lines (278 loc) · 9.75 KB

Requirements Analysis

  1. Core Functionality:

    • Convolution (1x1 to 7x7 kernels)
    • Pooling (max/avg)
    • Activation (ReLU, Sigmoid, Softmax)
    • DMA streaming from/to DDR
  2. Platform Constraints:

    • XC7Z020: 220 DSPs, 140 BRAM36K, ~53K LUTs
    • 2-month timeline with 4-5 students
    • Student project (simplicity matters)
  3. Evaluation Priorities (from requirements):

    • Architecture (45%): Parameterization, scalability, pipelining, throughput/resource
    • RTL quality (30%): Code quality, timing
    • Verification (10%)
    • SoC demo (15%)

Current Design Issues

  1. Over-engineered for a student project:

    • 700+ line top_controller.v with complex stall logic
    • Scoreboarding, hazard detection
    • 13 instruction types
    • Multiple FSMs interacting
  2. Row-stationary dataflow complexity:

    • 55-cycle psum drain latency
    • Complex PE interconnect
    • Valid signal propagation through 14 rows
  3. ISA complexity:

    • 13 instructions, some dual-word
    • Hardware loop counters
    • Scoreboarding

What Would Be Simpler/Better?

Option A: Streaming Architecture (Simpler)

Instead of a 2D PE array with complex dataflow, use a 1D processing pipeline:

Input Stream → Line Buffer → Sliding Window → MAC Array → Accumulator → Activation → Output Stream

Advantages:

  • No complex 2D interconnect
  • Natural streaming (matches DMA model)
  • Easier to understand and verify
  • No psum chain latency issues

For convolution:

  1. Stream input pixels through line buffers (BRAM)
  2. Sliding window extracts KxK patch
  3. MAC array computes dot product in parallel
  4. Accumulate across input channels
  5. Apply activation
  6. Stream out

Option B: Simplified PE Array (Medium Complexity)

Keep the 2D array but simplify:

  1. Output-stationary dataflow instead of row-stationary:

    • Weights stream through
    • Inputs broadcast
    • Outputs accumulate locally
    • Simpler control, no psum chain
  2. Simpler control:

    • Remove scoreboarding (use explicit sync instructions)
    • Remove hardware loops (software handles iteration)
    • Single-word instructions only
  3. Single FSM controller:

    • IDLE → LOAD_WEIGHTS → LOAD_INPUTS → COMPUTE → STORE_OUTPUTS → IDLE
    • No pipeline hazards to handle

Option C: Memory-Centric Design (Pragmatic for Demo)

Focus on what actually needs to be demonstrated:

  • One working convolution layer
  • One working pooling layer
  • Activation functions
  • DMA transfers

Design:

  1. Single convolution unit (not array):

    • 49 parallel MACs (max 7x7 kernel)
    • Input buffer with sliding window logic
    • Weight BRAM
    • Output accumulator
  2. Simple state machine:

    • ARM programs kernel size, input dims, output dims
    • ARM triggers start
    • Hardware processes entire tile
    • Hardware signals done
    • ARM reads results
  3. Fewer instructions:

    • START_CONV (with parameters in registers)
    • START_POOL
    • STATUS_CHECK
    • That's it.

My Recommendation: Hybrid Approach

For a 2-month student project, I'd recommend:

Architecture: 1D Systolic Array with Line Buffers

DDR → DMA → Line Buffers → Window Gen → Systolic MAC Chain → Accumulator → Activation → DMA → DDR
                              ↑
                          Weight BRAM

Key Components:

  1. Line Buffer (BRAM-based):

    • Stores K rows for KxK convolution
    • Natural sliding window extraction
    • Double-buffered for continuous streaming
  2. 1D Systolic MAC Array (1×N, not NxN):

    • N = 14 to 49 (depending on resources)
    • Each PE: weight_reg × input → accumulator
    • Simpler than 2D: only horizontal data flow
    • Weights shift horizontally, inputs broadcast
  3. Simple Control:

    • AXI-Lite registers for configuration
    • Single FSM: IDLE → CONV → POOL → DONE
    • No instruction memory (configuration registers only)
  4. Pooling Unit:

    • Separate 2×2 comparator tree
    • Shares line buffer with convolution

ISA: Register-Based Configuration (No Instructions!)

Instead of an instruction memory and decoder, use AXI-Lite configuration registers:

Register Name Description
0x00 CTRL [0]=start, [1]=pool_mode, [2]=relu_en
0x04 STATUS [0]=busy, [1]=done
0x08 INPUT_ADDR DDR address for input
0x0C OUTPUT_ADDR DDR address for output
0x10 WEIGHT_ADDR DDR address for weights
0x14 INPUT_SIZE {height[15:8], width[7:0]}
0x18 KERNEL_SIZE {kh[3:0], kw[3:0]}
0x1C CHANNELS {out_ch[15:8], in_ch[7:0]}

Why this is better:

  1. ARM software handles layer iteration (it's doing this anyway)
  2. No instruction decoder complexity
  3. No instruction memory BRAM cost
  4. Simpler verification (just test each config)
  5. Easier to demonstrate (change register, see result)

Dataflow: Output-Stationary with Streaming

For each output pixel (r, c):
    For each output channel (oc):
        acc = 0
        For each input channel (ic):
            For each kernel position (kr, kc):
                acc += weight[oc][ic][kr][kc] * input[ic][r+kr][c+kc]
        output[oc][r][c] = relu(acc)

Hardware mapping:

  • Outer loops (r, c, oc) controlled by FSM counter
  • Inner loops (ic, kr, kc) unrolled in parallel MACs
  • Line buffer provides sliding window access
  • Weight BRAM pre-loaded before layer starts

Why This Design is Better

  1. Simpler to Understand:

    • Linear data flow (input → process → output)
    • Single FSM, not multiple interacting FSMs
    • No complex dataflow (row/weight/output stationary)
  2. Easier to Verify:

    • Fewer states
    • No hazard detection needed
    • Test one layer at a time
  3. Better Resource Utilization:

    • Line buffers reuse BRAM efficiently
    • No wasted PE cycles waiting for psum chain
  4. Matches DMA Model:

    • Streaming naturally aligns with DMA
    • No double-buffering complexity
  5. Achievable in 2 Months:

    • Each component is independently testable
    • Incremental development possible
    • Demo-ready faster

Trade-offs

Cons:

  • Lower peak throughput than 2D array
  • Less "impressive" on paper
  • Won't process as many MACs per cycle

Pros:

  • Actually works and can be demonstrated
  • Easier to debug when things go wrong
  • Better timing (shorter critical paths)
  • More time for optimization if schedule allows

Detailed Component Design

Let me sketch out what each component would look like...

Line Buffer

module line_buffer #(
    parameter WIDTH = 224,
    parameter DEPTH = 7,  // max kernel height
    parameter DATA_W = 8
) (
    input clk, rst,
    input [DATA_W-1:0] pixel_in,
    input pixel_valid,
    output [DATA_W*DEPTH*DEPTH-1:0] window_out,  // KxK window
    output window_valid
);
    // K row FIFOs implemented in BRAM
    // Sliding window extraction combinational logic
endmodule

MAC Unit (Simplified)

module mac_unit (
    input clk, rst,
    input signed [7:0] input_val,
    input signed [7:0] weight_val,
    input clear,
    output reg signed [31:0] accum
);
    always @(posedge clk) begin
        if (rst || clear)
            accum <= 0;
        else
            accum <= accum + input_val * weight_val;
    end
endmodule

Convolution Engine

module conv_engine #(
    parameter MAX_K = 7,
    parameter NUM_MACS = 49  // MAX_K * MAX_K
) (
    input clk, rst,

    // Configuration
    input [2:0] kernel_h, kernel_w,
    input start,
    output done,

    // Window input (from line buffer)
    input [8*MAX_K*MAX_K-1:0] window,
    input window_valid,

    // Weights (pre-loaded)
    input [8*NUM_MACS-1:0] weights,

    // Output
    output signed [31:0] result,
    output result_valid
);
    // Instantiate NUM_MACS mac_units
    // Parallel multiply-add tree
    // Single-cycle dot product (combinational reduction tree)
endmodule

Top Controller (Simple FSM)

module conv_ctrl (
    input clk, rst,

    // AXI-Lite config (simplified)
    input [31:0] cfg_input_addr,
    input [31:0] cfg_output_addr,
    input [15:0] cfg_input_h, cfg_input_w,
    input [3:0] cfg_kernel_h, cfg_kernel_w,
    input start,
    output reg done, busy,

    // DMA interface
    output reg dma_rd_start,
    output reg [31:0] dma_rd_addr,
    input [63:0] dma_rd_data,
    input dma_rd_valid,

    output reg dma_wr_start,
    output reg [31:0] dma_wr_addr,
    output reg [63:0] dma_wr_data,
    input dma_wr_ready
);
    localparam IDLE = 0, LOAD_WEIGHTS = 1, PROCESS = 2, DONE = 3;
    reg [1:0] state;

    always @(posedge clk) begin
        case (state)
            IDLE: if (start) state <= LOAD_WEIGHTS;
            LOAD_WEIGHTS: if (weights_loaded) state <= PROCESS;
            PROCESS: if (output_complete) state <= DONE;
            DONE: state <= IDLE;
        endcase
    end
endmodule

This is MUCH simpler than the current 700-line top_controller.v with scoreboarding and multiple FSMs.

Summary

The current design is impressive from an academic standpoint but over-engineered for the requirements. A simpler design would:

  1. Use 1D systolic array or parallel MAC tree (not 2D PE grid)
  2. Use register-based configuration (not instruction memory)
  3. Use line buffers for sliding window (natural for convolution)
  4. Use single simple FSM (not multiple interacting FSMs)
  5. Focus on demonstrable functionality over peak throughput

This would score well on:

  • Architecture (45%): Still demonstrates parallelism, pipelining, scalability
  • RTL quality (30%): Cleaner, simpler code that's easier to time
  • Verification (10%): Much easier to test
  • Demo (15%): More likely to actually work