ring_buffer

Introduction

This project aims to implement a mock ring buffer using C multithreading. The implementation will subsequently be ported to an FPGA to support the efficient execution of "Configurable DSP-Based CAM Architecture for Data-Intensive Applications on FPGAs".

In this project, we focus on a one-to-one ring buffer implementation. This model is ideal for our target FPGA platform, which requires high-speed communication between the Processing System (PS) and Programmable Logic (PL).

The original implementation used General Purpose (GP) AXI-Lite ports to send instructions from the PS to the DSP-based CAM hardware in the PL. However, the AXI-Lite protocol introduces significant overhead due to bus arbitration and handshake latencies, becoming a system bottleneck. By implementing a ring buffer in On-Chip Memory (OCM), we leverage dual-port RAM characteristics, allowing the PS and PL to access shared data with minimal latency and significantly higher throughput compared to standard bus transactions.

Technical Deep Dive: AXI-Lite Bottleneck vs. OCM Efficiency

What is AXI Bus Arbitration?

In a SoC like the Zynq-7000, multiple "Master" devices (e.g., CPU cores, DMA, Video Engines) share the same "Slave" resources via an Interconnect. AXI Bus Arbitration is the management mechanism that decides which Master gets control of the bus when multiple requests occur simultaneously.

Think of it as a traffic light:

Latencies: Every transaction must request access, wait for the arbiter to grant it, and perform a multi-step handshake (VALID/READY).
Jitter: If other parts of the system are busy, the CPU might wait longer to send a single command, causing unpredictable timing.

Why OCM is Faster

By moving to an OCM-based Ring Buffer, we effectively bypass the "traffic light":

Dedicated Path: OCM in Zynq is designed for low-latency, high-priority access. It often features multiple ports, allowing the PS and PL to access memory simultaneously without competing for the same bus cycle.
Asynchronous Decoupling: Instead of a "Stop-and-Wait" approach (where the CPU waits for an AXI response), the PS simply writes data to the buffer and moves on. The PL reads the data at its own clock rate.
Reduced Overhead: We eliminate the per-transaction arbitration overhead, replacing many small AXI-Lite bursts with a continuous stream of data through shared memory.

What is Ring Buffer?

A ring buffer (or circular buffer) is a fundamental data transfer mechanism, particularly useful for asynchronous processes where a producer and consumer operate at different speeds.

Key Characteristics

Fixed Size: Memory is pre-allocated, avoiding dynamic allocation overhead during runtime.
Lock-Free Potential: In a Single-Producer Single-Consumer (SPSC) model, the ring buffer can be implemented without expensive mutexes or semaphores, provided that pointer updates are atomic and memory barriers are respected.
FIFO Logic: Data is processed in the order it was received.

A typical ring buffer is managed by head and tail pointers:

Producer (PS): Writes data to the head and then increments it.
Consumer (PL): Reads data from the tail and then increments it.

The buffer is "Full" when (head + 1) % SIZE == tail and "Empty" when head == tail.

Implementation Details

Language Selection

While C++ is increasingly common in embedded systems development, we have chosen C for the following reasons:

Toolchain Compatibility: The PS-side control for our FPGA (Zynq®-7000) is natively supported by C-based Xilinx drivers.
Memory Determinism: C structs guarantee a predictable memory layout. By using __attribute__((packed)) and specific alignment pragmas, we can ensure the software structure maps exactly to the hardware-defined OCM addresses. C++ features like virtual tables or name mangling can introduce hidden offsets.
Execution Predictability: C avoids implicit overheads (e.g., hidden constructors or exception handling logic). This is critical for meeting the strict timing requirements of PL-PS synchronization.

Technical Specifications

Concurrency: Single-Producer Single-Consumer (SPSC).
Synchronization: The mock implementation uses atomic operations (stdatomic.h) to simulate hardware memory consistency. In the final FPGA port, explicit memory barriers (DMB/DSB instructions) will be used to ensure the PL sees the data write before the head pointer update.
Cache Coherence: Shared OCM ring-buffer memory is treated as non-cacheable on the PS side for the first implementation stage. This matches the intended PL direct memory access behavior and avoids stale cache-line visibility issues during bring-up.
Ring Buffer Size: 16 KB (16,384 Bytes)
Instruction Unit Size: 512 bits (64 Bytes)
Overflow Policy: "Drop Latest" – if the buffer is full, the producer will not overwrite unread data, ensuring data integrity for the CAM hardware.

Future Testing Note: AXI Performance & Instruction Width

During on-board testing on the FPGA, we may evaluate increasing the Instruction Unit Size to 544 bits.

The Rationale: There is a potential concern that exceeding the standard 512-bit boundary might trigger additional AXI Packaging/Re-alignment overhead in the interconnect, which could impact throughput or latency.
Goal: We will benchmark both 512-bit and 544-bit configurations to verify if AXI re-packing occurs and choose the size that yields the optimal system performance.

Optimization

Using a power-of-two size (16 KB) allows us to replace the costly modulo operator (%) with a bitwise AND (&). With a 64-Byte instruction, this buffer holds exactly 256 instructions, preventing any wrap-around splitting at the boundaries.

// Instead of:
head = (head + 1) % SIZE;
// We use:
head = (head + 1) & (SIZE - 1);

Current Test Workflow

The current test suite focuses on fast, deterministic functional validation of the SPSC ring buffer.

How to Run

Build all test binaries:
- make all
Run all tests:
- make test
Run a specific test:
- make run-boundary
- make run-wrap-drop
- make run-integration

What Each Test Covers

boundary_test
- Validates empty-buffer consume behavior, full-buffer write rejection, FIFO drain correctness, and post-drain empty behavior.
wrap_drop_test
- Validates wrap-around index correctness across ring boundaries and confirms drop-latest does not overwrite unread data.
integration_test
- Runs threaded producer/consumer behavior for a short duration (default 2 seconds) and validates end-to-end ordering/integrity counters (Produced, Consumed, Dropped, Errors).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Core		Core
Playgrounds		Playgrounds
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
developer_diary.md		developer_diary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ring_buffer

Introduction

Technical Deep Dive: AXI-Lite Bottleneck vs. OCM Efficiency

What is AXI Bus Arbitration?

Why OCM is Faster

What is Ring Buffer?

Key Characteristics

Implementation Details

Language Selection

Technical Specifications

Future Testing Note: AXI Performance & Instruction Width

Optimization

Current Test Workflow

How to Run

What Each Test Covers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ring_buffer

Introduction

Technical Deep Dive: AXI-Lite Bottleneck vs. OCM Efficiency

What is AXI Bus Arbitration?

Why OCM is Faster

What is Ring Buffer?

Key Characteristics

Implementation Details

Language Selection

Technical Specifications

Future Testing Note: AXI Performance & Instruction Width

Optimization

Current Test Workflow

How to Run

What Each Test Covers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages