TMMA: A Tiled Matrix Multiplication Accelerator for Self-Attention Projections in Transformer Models

Optimized for Edge Deployment on Xilinx KV260

🔬 Ongoing Research & Work in Progress

This project is an active research effort, and the implementation is currently under development. We plan to open-source the full code once our research paper is published. Some components may be incomplete or restricted for now.

If you're interested in this project, feel free to watch this repository for updates or reach out for discussions. 🚀

Overview

TMMA is an FPGA-based accelerator designed to efficiently execute dense matrix multiplication operations, with a primary focus on accelerating the self-attention projection computations in Transformer-based Large Language Models (LLMs). Although our initial goal was to accelerate full DistilBERT inference, development timeline and resource constraints led us to concentrate on optimizing the projection computations (i.e., the matrix multiplications underlying the Q, K, and V projections) in the Multi-Head Self-Attention (MHA) module.

Optimized for resource-constrained edge devices such as the Xilinx KV260 Vision AI Starter Kit, TMMA leverages efficient dataflow, pipelining, and fixed-point arithmetic to achieve a balanced trade-off between resource utilization, latency, and throughput. This makes it a compelling alternative to conventional CPU and GPU-based inference for critical matrix operations.

Key Features

Accelerated Self-Attention Projections: Optimizes the matrix multiplications critical to the Q, K, and V projections in Transformer models.
Edge-Friendly Deployment: Specifically designed for resource-constrained FPGAs like the Xilinx KV260, enabling on-device inference.
Efficient Memory Utilization: Minimizes external DRAM access by maximizing on-chip BRAM usage through tiling and data reuse strategies.
Vivado HLS-Based Implementation: Developed using high-level synthesis (HLS) for rapid prototyping and iterative design optimizations.
Energy Efficiency: Demonstrates significant energy savings compared to CPU-based implementations, making it suitable for power-constrained applications.

Project Structure

TMMA/
│── hls/                   # Accelerator source code (Vivado HLS)
│── vivado/                # Vivado design snapshots and constraints
│── models/                # Benchmarking scripts and model integration (e.g., DistilBERT)
│── reference/             # Related reference materials and papers
│── pynq/                  # PYNQ runtime layer and example notebooks (run on KV260)
│── README.md              # This file
│── LICENSE                # License information

Setup and Installation

git clone https://github.com/yourusername/TMMA.git
cd TMMA

Follow the instructions in the hls/ directory to build the accelerator using Vivado HLS.
For FPGA deployment, refer to the example notebooks in the pynq/ directory.

Benchmarking

Detailed benchmarking scripts and performance evaluations are provided in the models/ and pynq/ directories. Our benchmarks include:

Standalone GEMM: Performance evaluation on random matrices.
DistilBERT Attention Throughput: Inference performance when offloading self-attention projection computations to the FPGA.

Roadmap

✅ Completed:

Systolic Array Design for Matrix Multiplication
Vivado HLS Implementation
FPGA Deployment on Xilinx KV260
Memory Optimization for Edge Deployment

🚧 In Progress (Ongoing Research):

Extending Acceleration to Additional Transformer Components (e.g., Softmax, FFN)
Expanding Model Compatibility (e.g., GPT-based LLMs)
Optimizing for Larger Batch Sizes

📢 Release

Code published and tagged as v1.0.0

References

arXiv Paper: https://arxiv.org/abs/2503.16731
Ashish Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Victor Sanh et al., "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019)
S. Lu et al., "Hardware Accelerator for Transformer" (IEEE SOCC 2020) DOI:10.1109/SOCC49529.2020.9524802
Xilinx, "Vitis AI Documentation" (2023)
Shulin Zeng et al., "FlightLLM: FPGA-Based LLM Acceleration" (FPGA '24)
Jinming Zhuang et al., "SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration" (FPGA '24)

Note

This repository does not include compiled bitstreams or full Vivado projects. Users should recreate the hardware platform using the provided HLS code and integration diagrams. Ensure appropriate licenses for any Xilinx IP cores used in your design.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
hls		hls
model/distilBert		model/distilBert
pynq		pynq
vivado/2-27_bd&df		vivado/2-27_bd&df
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TMMA: A Tiled Matrix Multiplication Accelerator for Self-Attention Projections in Transformer Models

Optimized for Edge Deployment on Xilinx KV260

🔬 Ongoing Research & Work in Progress

Overview

Key Features

Project Structure

Setup and Installation

Benchmarking

Roadmap

✅ Completed:

🚧 In Progress (Ongoing Research):

📢 Release

References

Note

License

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TMMA: A Tiled Matrix Multiplication Accelerator for Self-Attention Projections in Transformer Models

Optimized for Edge Deployment on Xilinx KV260

🔬 Ongoing Research & Work in Progress

Overview

Key Features

Project Structure

Setup and Installation

Benchmarking

Roadmap

✅ Completed:

🚧 In Progress (Ongoing Research):

📢 Release

References

Note

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages