-
Notifications
You must be signed in to change notification settings - Fork 2
1. Architecture Specification
This section contains a high-level architectural description of AGILE, its core functionality and its interface with Host Software.
AGILE Top Level Diagram. The main functional units are shown in green. Blue blocks represent various memories responsible for storing graph data and intermediate computations. Purple blocks represent AXI interconnects for Host and Memory communication. Xilinx IP blocks are shown in red.
As shown in Figure 1, AGILE is composed of the following functional units.
-
Node Scoreboard (NSB): responsible for communication with the Host device and scheduling workload onto the accelerator. The Host issues instructions to AGILE through the AXI-Lite interconnect. Firstly, layer configuration is programmed to determine global parameters used by the accelerator. Subsequently, node information is programmed into Nodeslots, which maintain a state machine for each node currently in-flight within the accelerator.
-
Aggregation Engine (AGE): responsible for performing permutation-invariant aggregation functions over all neighbours of a node. After receiving an aggregation request from the NSB for an arbitrary node, AGG requests node features from the Feature Prefetcher and assigns the node to a subset of its Aggregation Cores (AGC) according to the required numerical precision. Subsequently, aggregation results are stored in the Aggregation Buffer.
-
Prefetcher (PREF): at the start of layer computation, the Prefetcher receives a request from the NSB to fetch the weights required for updating the node feature embeddings from DRAM through its AXI interface. This operation takes place once per layer. Subsequently, the Prefetcher fetches neighbouring features and edge embeddings for each node in the graph. This is executed while the Aggregation and Transformation Engines are fully populated with work to hide the latency of memory access. Weights, features and edges are stored in local RAMs until required by subsequent steps in the pipeline.
-
Feature Transformation Engine (FTE): whenever aggregation results are available in the Aggregation Buffer, the FTE computes the updated feature embeddings by performing matrix multiplication with the learned parameters stored in the Prefetcher. Results are subsequently written back out to DRAM or optionally written into the Transformation Buffer for further processing.
The AXI-L interconnect links the host to several register banks within the design for configuration purposes. An AXI interconnect links the DRAM Controller to the Prefetcher for access to external memory. See Section [section:milestone1] for further discussion on AXI interconnects and other Xilinx IPs.
The Node Scoreboard (NSB) is the primary interface between the Host and AGILE. To perform inference on a Graph, the Host must issue instructions into the accelerator via a sequence of register writes to the register bank in the NSB. The NSB decodes these instructions and drives functional units within the Microarchitecture to fetch feature data from memory and perform node aggregation and transformation before storing the results back into DRAM.
At the start of operation, the Node Scoreboard must be configured by the Host with a set of graph and layer parameters such as the total number of nodes and edges. Crucially, the input and output feature counts for each layer are required to determine the address range for each node in the memory storage elements (Prefetcher and Aggregation/Transformation/Output Buffers). Once the graph/layer configuration is marked as valid, the Host is free to start programming Nodeslots into the scoreboard.
The NSB keeps a state machine for each node being processed by the accelerator in its scoreboard, which is a register bank containing up to 64 Node Slots (indexed by a 6-bit Slot ID). Each Node Slot (NS) contains all the information required by AGILE to compute the node aggregation and transformation. Based on these payloads, the Node Scoreboard drives the internal functional units to perform the computation and updates its internal state machine after each functional step. As such, no further intervention is required from the Host after the Node Slot programming, which reduces the overhead of workload distribution on the Host. When the computation of a node is complete, the NSB sends an interrupt to the Host to handle any required housekeeping.
Example state of the Node Scoreboard at runtime. NS0 and 1 use floating-point activation precision due to their higher neighbour count. NS 3 and 4 are in the EMPTY state, but NS3 has already been programmed with the address for incoming messages. NS62 has received all its programming, but it’s pending backpressure from the Prefetcher. Nodeslots 5 → 61 follow a similar structure.
The fields shown in Figure 2 represent a subset of the total number of fields in the NSB scoreboard. Although data transfers between functional units in the accelerator take place in reference to each node’s Slot ID, the NSB also keeps a copy of the Node ID, which is a 20-bit field used to differentiate each individual node in the graph. As such, AGILE is capable of supporting graphs up to 1M nodes (although this number is parametrizable during configuration).
AGILE supports defining custom arithmetic types for each node in the graph through the PRECISION field in the Node Scoreboard. The benefits of lower precision numerical formats during neural network inference have been shown in works such as by Nagel et al. . The motivation for mixed precision nodes in GNN inference lies with the observation by Tailor et al. that the accuracy cost of quantization is predominantly due to the aggregation phase and is directly correlated to a node’s degree. As such, quantization strategies where high-precision numerical types are used for “crowded" nodes and low-precision types are used for more isolated nodes lead to decreased inference latency at a significantly reduced cost to model accuracy.
It should be noted that the order in which nodes are programmed within the Nodeslots does not imply any priority or time correlation. Various nodes may have vastly different execution times depending on neighbour count, numerical precision etc. Whenever a Nodeslot finishes its computation, it can be immediately reprogrammed by the Host with the next node. Within the NSB, nodes are serviced with Round-Robin arbitration in each clock cycle to determine which node gets access to resources within the Aggregation and Transformation Engines.
Node Slot state machine. Each box represents a state, with the transition conditions shown alongside the arrows.
-
EMPTY: in this state, the Nodeslot has no active workload on the accelerator and the Host is free to program payloads into the NSB. A mask of all empty slots is visible from the Host so that it can decide what slot to use from a single register poll.
-
PROG_DONE: the Nodeslot transitions into this state when a write is received to the corresponding slot in the auto-clearing MAKE_VALID register. In this state, the Node Scoreboard waits until the Prefetcher can receive a fetch request. There may be backpressure from the Prefetcher in configurations where the number of Fetch Tags is lower than the number of Nodeslots.
-
FETCH_NB_LIST: while transitioning to this state, the NSB issues a request to the Prefetcher to fetch the list of incoming message addresses for the given node from the start address defined in the Nodeslot programming. The Nodeslot is held in this state until the NSB receives a done response from the Prefetcher, signalling that either 1. (OK) all incoming addresses are stored in the associated Fetch Tags or 2. (PARTIAL) the Fetch Tag FIFOs are full, and the Prefetcher will continue to fetch addresses as these are consumed by the NEIGHBOUR_FETCH stage.
-
FETCH_NEIGHBOURS: once the done response is received from the Prefetcher, the Nodeslot transitions to this state and the NSB issues another request to the Prefetcher, now to fetch the incoming messages at the addresses already stored in the Fetch Tag. The Nodeslot is held in this state until the NSB receives another done response from the Prefetcher, signalling that either 1. (OK) all incoming messages and edges have been stored in the associated Fetch Tags or 2. (PARTIAL) the Fetch Tag FIFOs are full, and the Prefetcher will continue to fetch incoming messages and edges once the data is consumed by the Aggregation Engine. If the last incoming message is fetched after the initial done response, the Prefetcher issues an additional status update to the Node Scoreboard so the ALL_MESSAGES_FETCHED mask can be updated in the Node Scoreboard for Debug purposes.
-
AGGREGATION: when transitioning into this state, the Nodeslot issues a request to the Aggregation Engine to begin performing the aggregation function over all incoming messages of the node stored in the Feature Prefetcher. The Nodeslot is held in this state until a done response is received from the Aggregation Engine, signalling that the aggregated feature embedding has been stored in the Aggregation Buffer.
-
TRANSFORMATION: with the aggregated feature embedding stored in the Aggregation Buffer, the Nodeslot transitions to this state when the transformation request is accepted by the Transformation Engine. The request is issued by the NSB for a combined group of Nodeslots when the number of available aggregations in the buffer matches the transformation wait count parameter. This state is held until a done response is received, signalling that the updated feature embedding has been stored in the Transformation Buffer.
In addition to controlling the Nodeslot state machines, the NSB drives the Prefetcher to fetch multi-precision weights at the start of each layer computation through a set of set-auto-clear registers. This flow requires the host to first set the CTRL_WEIGHTS_FETCH_PRECISION to the required precision, then write to the CTRL_FETCH_WEIGHTS registers, which is automatically cleared by the register bank logic. When this pulse is detected, the NSB triggers its internal state machine, asserting the CTRL_WEIGHTS_FETCH_DONE register when finished. The host is required to periodically read this register until it finds its value set to 1, at which point it writes to CTRL_FETCH_DONE_ACK, which is also auto-clearing. Finally, the hardware de-asserts the done register when the ACK signal is received.
Algorithm [alg:layer_config] shows how the inference workload can be offloaded onto AGILE from the Host device. First, the Node Scoreboard is programmed with a number of global parameters. Layer configuration registers are programmed into the NSB before each layer computation. Each node is then programmed into the first available Nodeslot, according to a mask of free Nodeslots, which is updated asynchronously whenever a Nodeslot is programmed or freed.
global parameter 𝒫, layers ℒ, graph 𝒢 nsb_regbank.global_parameters ← 𝒫
nsb_regbank.layer_config ← layer prefetch_layer_weights()
nodeslot_idx ← choose_nodeslot(free_nodeslots) nsb_regbank [ nodeslot_idx ] ← node_programming [ node ]
The requirement to handle large graphs ( > 1M nodes) demands a reformulation of how feature and adjacency data is stored and communicated within the accelerator. Previous works have made extended use of on-chip memory. For example, GenGNN contains a CSR table populated with a list of neighbours for each node. However, the memory capacity on high-end FPGAs such as the Ultrascale+ (shown in Table 1) is such that computation on large graphs becomes unfeasible.
| Resource | Blocks available | Storage/block | Total Capacity |
|---|---|---|---|
| BRAM | 2688 | 36 Kbits | 12.39 MB |
| URAM | 1280 | 288 Kbits | 47.19 MB |
| DRAM | 4 | 16 GB | 64 GB |
Total on-chip and off-chip memory capacity on the Xilinx U250 board
It is worth comparing the resource capacity on the U250 board with the memory requirement for several large graphs, as shown in Table [table:large_graphs_size]. It can be seen that the adjacency list fits within off-chip DRAM for all the datasets, although only the Flickr graph can be feasibly stored on-chip. The input feature data can also be fully stored on off-chip memory for all the datasets except ogbn-papers, due to its high node count ( > 100M). It is worth noting that the reported figures show the memory requirement for only the input data to the GNN, however, the total memory requirement grows linearly with the number of GNN layers in the model since intermediate activations must be stored.
In addition to on-chip RAM blocks and off-chip DRAM, AGILE can access Host RAM by a DMA Core that acts as an AXI/PCIe bridge. This extends the total available memory to any required capacity, at a latency cost trade-off.
AGILE requires access to the adjacency data of the graph to determine the memory address for incoming messages during the aggregation phase. A number of approaches were considered to handle the communication of graph data to the accelerator. The following approaches were considered:
-
Edge tensor: adjacency data is typically represented in datasets as a tensor of shape [2, EDGE_COUNT], which corresponds to a list of tuples where each linked node is encoded by a Node ID (the bit width for each node ID must be set to appropriately encode every node in the graph). In undirected graphs, the edge tensor is duplicated so that each node in each tuple is included as both a source and destination node. When sorted, this then corresponds to a list of all neighbours of each node, encoded by Node ID. As shown in Table [table:large_graphs_size], the full edge tensor for large graphs can be feasibly stored in off-chip memory.
-
Adjacency matrix: graphs are often represented mathematically as an N × N binary matrix, with a 1 representing a link between the nodes associated with the corresponding row and column. Previous accelerators such as by Zhang et al. have used graph partitioning mechanisms to store sections of the adjacency matrix on-chip. However, the incurred memory cost grows in O(N2) with the number of nodes, N. For example, the memory requirement for the Reddit graph is 6.7GB compared to 487MB in the edge tensor format. Additionally, the majority of the data is zero due to the sparse structure of graphs.
-
Neighbourhood bit-mask: each row of the adjacency matrix corresponds to a binary mask of all neighbouring nodes. This mask can be stored in the Node Scoreboard as part of the Nodeslot programming. Assuming support for up to 1M nodes, this incurs a requirement of 130kB of data per node or 8.3MB for all 64 Nodeslots.
Storage of the adjacency matrix or per-node neighbourhood bit mask was ruled out due to the high memory capacity requirements and poor bandwidth utilization. It was considered that the Host can program all neighbouring node IDs directly into the Node Scoreboard, however, it is not feasible to allocate enough on-chip RAM for the upper limit of 1M neighbours per node - this leads to a storage requirement of 167MB for all 64 Nodeslots. It would be possible to use a smaller storage element by allocating a lower neighbour limit, such as 1024, which leads to a memory requirement of 163kB. This would be enough to support most nodes, since the highest average node degree from the datasets in Table [table:large_graphs_size] is 488 (for the Reddit graph). However, the Nodeslot state machine would need to be updated so that once incoming messages were aggregated for all programmed nodes, the accelerator could request subsequent neighbours to be stored back into the Node Scoreboard. Since the previous neighbour IDs would be overwritten, the process would need to be repeated after feature transformation, to compute the outgoing messages.
To limit the requirement for high-latency communication between the Host and accelerator through PCIe, the chosen approach was that AGILE will fetch the adjacency list for each node from DRAM. For this purpose, the Host will program the start address for each Nodeslot’s list of neighbouring node IDs as part of the Nodeslot programming.
An alternative approach was considered in which the accelerator does not require access to the graph adjacency data. By requiring the Host to program all incoming messages for each node in a contiguous range on off-chip memory, the Host can simply program each Nodeslot with a INCOMING_MESSAGES_START_ADDRESS field and the node degree (i.e. neighbour count). This leads to improved memory bandwidth since all incoming messages can be fetched from a single AXI transaction with the appropriate burst length. However, the memory requirement grows exponentially since the incoming messages for a given node will be stored once for each of its neighbours. The feature memory requirement for the AmazonProducts dataset grows from 1.2GB to 202GB assuming floating-point precision, for a single layer. It would not be feasible to perform this memory arrangement as a pre-processing step prior to inference on the graph, since a single layer is likely to saturate the memory capacity of Host RAM. To support this approach, the Host would be required to perform the memory manipulation in real time, prior to programming a node into one of the Nodeslots in the NSB. This would introduce a large increase in power consumption, as well as a major performance bottleneck if the memory arrangement cannot be pipelined while every Nodeslot is busy.