This document describes how to build and use PoCL's MLIR-based HLS compilation flow, which automatically synthesizes FPGA accelerators from standard OpenCL C programs. The toolchain compiles OpenCL kernels through MLIR intermediate representations and generates FPGA bitstreams using vendor back-end tools (Vitis HLS / Vivado for AMD, Intel AOC for Altera).
For the background and evaluation results, see:
T. LeppÀnen, L. LeppÀnen, Z. Jamil, J. Solanti, J. Multanen, and P. JÀÀskelÀinen, "Composable Open-Source Toolchain for Synthesizing Hardware Accelerators from OpenCL Command Buffers," ACM Trans. Reconfig. Technol. Syst., 2026. https://doi.org/10.1145/3786204
The compilation flow is:
OpenCL C --> Polygeist/ClangIR --> MLIR (upstream dialects)
--> PoCL middle-end passes (workgroup generation, barrier elimination)
--> ScaleHLS / Hida optimization passes
--> scalehls-translate (emit HLS C++)
--> Vitis HLS (C++ to RTL)
--> Vivado (RTL wrapping as AlmaIF accelerator)
--> v++ (bitstream generation)
The resulting bitstream is loaded at runtime by the AlmaIF device driver via XRT (AMD) or OPAE (Altera). Compilation is JIT: the FPGA bitstream is generated the first time a kernel (or command buffer) is enqueued, then cached.
| Component | Version | Purpose |
|---|---|---|
| ClangIR | github/ClangIR | Base MLIR infrastructure, and one of the front-ends |
| Polygeist (optional front-end) | github/cpc/Polygeist | OpenCL C to MLIR front-end |
| ScaleHLS | github/cpc/Hida | HLS optimization passes and C++ emission |
| OpenASIP | github/cpc/OpenASIP | Soft processor acting as a controller |
| ISL | (tested with:) 0.26-3build1.1 (from apt) | Used by affine passes ported from Enzyme-JAX |
| Vitis | 2022.1 | C++ to bitstream (xclbin) |
| XRT | 2023.2 | Runtime for AMD FPGAs |
| FPGA Platform shell | xilinx_u280_gen3x16_xdma_base_1 |
Specific to Alveo U280 |
Tested with the following hardware:
- AMD: Alveo U280 (
xcu280-fsvh2892-2L-e) - Altera: BittWare IA-420f (Intel Agilex 7)
mkdir build && cd build
source /path/to/xrt/setup.sh
cmake .. \
-DENABLE_CLANGIR=ON \
-DENABLE_ALMAIF_DEVICE=ON \
-DSCALEHLS_DIR=/path/to/scalehls/install \
-DOPENASIP_LLVM_DIR=/path/to/openasip-llvm/install \
-DWITH_LLVM_CONFIG=/path/to/llvm-config
make -j$(nproc)| Variable | Description |
|---|---|
ENABLE_CLANGIR=ON |
Enables the MLIR compiler path (required) |
ENABLE_ALMAIF_DEVICE=ON |
Builds the AlmaIF accelerator device driver |
POLYGEIST_BINDIR |
Path to directory containing the cgeist binary (optional) |
SCALEHLS_DIR |
Root of ScaleHLS installation |
OPENASIP_LLVM_DIR |
LLVM installation root used by OpenASIP (as it can differ from the main ClangIR LLVM) |
When an OpenCL program calls clBuildProgram, PoCL:
-
Front-end: ClangIR (or Polygeist) converts OpenCL C to MLIR using upstream dialects (scf, affine, arith, memref, func, gpu).
-
Middle-end: PoCL MLIR passes generate the workgroup function:
- Links OpenCL built-in functions (implemented in MLIR)
- Wraps the SPMD kernel in
affine.parallel(local size bounds) - Eliminates barriers (Polygeist's barrier elimination pass)
- Allocates local memory
- Runs affine optimization passes (loop fusion, coalescing, LICM, CSE, mem2reg)
-
HLS back-end (at
clEnqueueorclFinalizeCommandBufferKHR):- ScaleHLS optimization passes (dataflow, pipelining, array partitioning)
scalehls-translate --scalehls-emit-hlscppemits Vitis-compatible C++- Vitis HLS synthesizes RTL from the C++
- Vivado wraps the RTL in an AlmaIF accelerator block design with an OpenASIP command processor
v++generates the final.xclbinbitstream
When using cl_khr_command_buffer, at clFinalizeCommandBufferKHR:
- Each kernel in the command buffer is compiled to a workgroup function
- A command buffer function is generated that calls all kernels sequentially, with all arguments and launch parameters specialized as constants
- The fused function is compiled through the HLS back-end as a single accelerator
- The bitstream contains one combined accelerator for the entire command buffer
This enables cross-kernel optimizations: constant propagation of arguments, known loop bounds, and potential inter-kernel dataflow optimizations.
Three setup scripts in tools/scripts/ configure the environment for different
execution modes. Source one before running a benchmark:
-
Software emulation (
source tools/scripts/setup_emu.sh): Runs on the host CPU (via LLVM host target) without any FPGA hardware or Vitis simulation. Uses the AlmaIF emulation device (POCL_ALMAIF0_PARAMETERS=0xE,...). Useful for testing the compiler front- and middle-ends without synthesizing hardware. -
Hardware emulation (
source tools/scripts/setup_hw_emu.sh): Runs in the Vitishw_emusimulator. Simulates the actual RTL but takes up to 20 minutes to generate the xclbin. Use small dataset sizes. -
Real hardware (
source tools/scripts/setup_hw.sh): Runs on physical FPGA hardware. Bitstream generation takes ~2 hours. To replicate the paper results, disable the small dataset size.
The PolybenchGPU benchmark suite is available with OpenCL command buffer support at: github/cpc/polybench
cd build
cmake .. -DENABLE_TESTSUITES=polybenchGPU # add to your existing cmake args
make prepare_examplesThis clones and builds the polybench suite as an external project.
For XRT-based execution, make sure you have sourced the XRT setup.sh and
the Vivado settings.sh:
Then, source one of the setup scripts above, and run the benchmarks:
cd build/examples/polybenchGPU/src/polybenchGPU-build
# Standard OpenCL version:
./OpenCL/GEMM/gemm
# Command buffer version:
./OpenCL-command-buffer/GEMM/gemm_cmd_bufferFor hw_emu-mode, you should also set:
export XRT_INI_PATH=/path/to/pocl/lib/CL/devices/almaif/mlir/xrt.ini # speeds up hw_emu
export EMCONFIG_PATH=/path/to/emconfig.jsoncd build
ctest -L polybenchGPU| Variable | Description |
|---|---|
POCL_DEVICES |
Set to almaif to select the AlmaIF device driver |
POCL_ALMAIF0_PARAMETERS |
Device parameters: (0xE for emulation; 0xA for hw_emu or hw), Initial xclbin (or 'none') path with 0xA, Kernel id (65535 represents HLS-generated kernels) |
POCL_ALMAIF_EXTERNALREGION |
External (DDR/HBM) memory region base address and size |
POCL_CACHE_DIR |
Directory for caching compiled kernels and bitstreams |
XCL_EMULATION_MODE |
Set to hw_emu for Vitis RTL simulation, unset for real FPGA |
PoCL caches intermediate compilation artifacts. Set POCL_CACHE_DIR=/where/you/want/to/cache.
The following intermediate files are generated:
parallel.mlir-- workgroup function after middle-end passesparallel_hls.mlir-- after ScaleHLS HLS optimization passesparallel_hls.cpp-- emitted HLS C++ for Vitis HLSparallel.xo-- Vivado-packaged XO fileparallel.xclbin-- final FPGA bitstreamfirmware.img-- OpenASIP command processor firmware
| Pass | Description |
|---|---|
pocl-workgroup |
Generates the workgroup function from an SPMD kernel |
pocl-distribute-barriers |
Barrier elimination using min-cut distribution (ported from Polygeist) |
pocl-mem2reg |
Memory-to-register promotion (ported from Polygeist) |
pocl-affine-cfg |
Raises scf/memref-ops to affine equivalents (ported from Polygeist/Enzyme-JAX) |
pocl-detect-reduction |
Detects and marks reduction patterns for HLS (ported from intel/llvm/mlir) |
pocl-affine-parallel-to-for |
Converts affine.parallel to affine.for loops |
pocl-convert-memref-to-llvm-kernel-args |
Converts memref kernel arguments for the arg-buffer launcher (LLVM lowering) |
pocl-strip-mem-spaces |
Removes memory space attributes before LLVM lowering |
The system is structured around the AlmaIF accelerator interface, which provides a vendor-portable memory-mapped protocol for controlling accelerators:
The OpenASIP command processor reads work packets from the host, configures the
HLS-generated accelerator IP with kernel arguments and launch parameters, and
signals completion. This wrapper is generated by the Vivado TCL scripts
(generate_xo.tcl) and uses RTL generated by OpenASIP (generateprocessor).
ScaleHLS pass failures: The HLS pass pipeline has a fallback mechanism. If
the full pipeline (with affine raising and ScaleHLS dataflow passes) fails, it
automatically retries without affine raising, and then without the more fragile
ScaleHLS passes. Check POCL_DEBUG=almaif output for retry messages.
Affine verification errors: If you see "is not a valid symbol" errors from
the MLIR verifier, this is likely due to arith.index_cast results being used
in affine map positions. The pocl-raise-to-affine pass handles most cases,
but complex control flow may trigger this. See the fallback mechanism above.
PoCL is a conformant implementation (for CPU and Level Zero GPU targets) of the OpenCL 3.0 standard which can be easily adapted for new targets.
This section contains instructions for building PoCL in its default configuration and a subset of driver backends. You can find the full build instructions including a list of available options in the install guide.
In order to build PoCL, you need the following support libraries and tools:
- Latest released version of LLVM & Clang
- development files for LLVM & Clang + their transitive dependencies
(e.g.
libclang-dev,libclang-cpp-dev,libllvm-dev,zlib1g-dev,libtinfo-dev...) - CMake 3.15 or newer
- GNU make or ninja
- Optional: pkg-config
- Optional: hwloc v1.0 or newer (e.g.
libhwloc-dev) - Optional (but enabled by default): python3 (for support of LLVM bitcode with SPIR target)
- Optional: llvm-spirv (version-compatible with LLVM) and spirv-tools (required for SPIR-V support in CPU / CUDA; Vulkan driver supports SPIR-V through clspv)
For more details, consult the install guide.
Building PoCL follows the usual CMake build steps. Note however, that PoCL can be used from the build directory (without installing it system-wide).
π· Achieved status of OpenCL conformant implementation
πΆ Tested in CI extensively, including OpenCL-CTS tests
π’ : Tested in CI
π‘ : Should work, but is untested
β : Unsupported
| CPU device | LLVM 17 | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 | LLVM 22 |
|---|---|---|---|---|---|---|
| x86-64 | π’ | π’ π· | π’ | πΆ | πΆ | π’ |
| ARM64 | π‘ | π‘ | π‘ | π‘ | π’ | π‘ |
| i686 | π‘ | π‘ | π‘ | π‘ | π‘ | π‘ |
| ARM32 | π‘ | π‘ | π‘ | π‘ | π‘ | π‘ |
| RISC-V | π‘ | π‘ | π‘ | π‘ | π‘ | π‘ |
| PowerPC | π‘ | π‘ | π‘ | π‘ | π‘ | π‘ |
| GPU device | LLVM 17 | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 |
|---|---|---|---|---|---|
| CUDA SM5.0 | π‘ | π’ | π‘ | π’ | β |
| CUDA SM other than 5.0 | π‘ | π‘ | π‘ | π‘ | β |
| Level Zero | π‘ | π‘ | π’ | π’ | πΆ |
| Vulkan | π’ | β | β | β | β |
Note: CUDA with LLVM 21 is broken due to a bug in Clang (llvm/llvm-project#154772).
| Special device | LLVM 17 | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 |
|---|---|---|---|---|---|
| OpenASIP | π’ | β | β | β | β |
| Remote | π’ | π’ | π’ | π’ | π‘ |
| CPU device | LLVM 17 | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 |
|---|---|---|---|---|---|
| Apple Silicon | π‘ | π‘ | π‘ | π’ | π’ |
| Intel CPU | π‘ | π‘ | β | β | β |
| CPU device | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 |
|---|---|---|---|---|
| MinGW / x86-64 | π‘ | π’ | π‘ | π‘ |
| MSVC / x86-64 | π‘ | π’ | π’ | π‘ |
PoCL with CPU device support can be found on many linux distribution managers.
See
PoCL with CUDA driver support for Linux x86_64, aarch64 and ppc64le
can be found on conda-forge distribution and can be installed with
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh # install mambaforge
To install pocl with cuda driver
mamba install pocl-cuda
To install all drivers
mamba install pocl
PoCL with CPU driver support Intel and Apple Silicon chips can be found on homebrew and can be installed with
brew install pocl
Note that this installs an ICD loader from KhronoGroup and the builtin OpenCL implementation will be invisible when your application is linked to this loader.
PoCL with CPU driver support Intel and Apple Silicon chips can be found on conda-forge distribution and can be installed with
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
To install the CPU driver
mamba install pocl
Note that this installs an ICD loader from KhronosGroup and the builtin OpenCL implementation will be invisible when your application is linked to this loader. To make both pocl and the builtin OpenCL implementaiton visible, do
mamba install pocl ocl_icd_wrapper_apple
PoCL is distributed under the terms of the MIT license. Contributions are expected to be made with the same terms.