Libraries-Openly-Fused · morousg · Mar 17, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 12, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,225 @@
+# Copilot Instructions for FusedKernelLibrary (FKL)
+
+## Project Overview
+
+**FusedKernelLibrary (FKL)** is a header-only C++17 library that enables automatic GPU kernel fusion (Vertical, Horizontal, Backwards Vertical, and Divergent Horizontal Fusion) for CUDA and CPU backends. The library lives under `include/fused_kernel/`. All public types (except for vector types) are in the `fk` namespace.
+
+- The primary entry point header is `include/fused_kernel/fused_kernel.h`.
+- The main user-facing function is `fk::executeOperations<DPPType>(stream, iop1, iop2, ...)`.
+- Current version: `0.1.14-LTS` (C++17 API freeze branch).
+
+## Repository Structure
+
+```
+FusedKernelLibrary/
+├── .github/workflows/       # CI workflows (Linux x86_64, Linux ARM64, Windows x64)
+├── cmake/                   # All CMake helper modules
+│   ├── cmake_init.cmake     # Global settings, output dirs, config types
+│   ├── cuda_init.cmake      # CUDA language enable + arch detection
+│   ├── libs/cuda/           # CUDA-specific helpers (archs, deploy, debug, target generation)
+│   ├── tests/               # Test discovery and generation (discover_tests.cmake, add_generated_test.cmake)
+│   └── generators/          # Code generators (version_header.cmake, export_header.cmake)
+├── include/fused_kernel/    # All library headers (header-only)
+│   ├── fused_kernel.h       # Top-level include + executeOperations free functions
+│   ├── core/
+│   │   ├── execution_model/ # Operation types, instantiable ops, DPPs, executors, stream
+│   │   ├── data/            # Data types: Ptr2D, Tensor, Size, Rect, Point, Tuple, Array, etc.
+│   │   ├── utils/           # Compiler macros, template utils, type lists, vector utils
+│   │   ├── constexpr_libs/  # Constexpr math (constexpr_cmath.h)
+│   │   └── core.h           # Include everything in core folder
+│   └── algorithms/
+│       ├── basic_ops/        # Arithmetic, cast, logical, memory ops, set, static loop, vector ops
+│       ├── image_processing/ # Crop, Resize, ColorConversion, BorderReader, Interpolation, Warp, etc.
+│       └── algorithms.h      # Include everything in Algorithms
+├── lib/                     # CMake INTERFACE library target (FKL::FKL) and install config
+├── tests/                   # Integration tests (discovered from .h files by CMake)
+├── utests/                  # Unit tests (discovered from .h files by CMake)
+├── benchmarks/              # Performance benchmarks (off by default)
+├── CMakeLists.txt           # Root CMake, version 0.1.14, requires CMake 3.24+
+└── .clang-format            # LLVM-based style, 4-space indent, 120 column limit
+```
+
+## Build System
+
+### ⚠️ Copilot Constraints
+- **Build Directory:** Always output compiled binaries, artifacts, or generated files to a `build` directory located strictly **outside** the current source folder (e.g., `../build`). Never create the build folder within the project repository.
+- **Source Directory** Never add any file that should not be part of the repository, in the source folder. Always create folders outside the source folder.
+- **Git Ignore:** As a consequence of the previous two rules, under no circumstances should you modify, append to, or suggest changes to the `.gitignore` file.
+
+### Requirements
+- **CMake ≥ 3.24** (CI uses cmake 4.2.1 custom install)
+- **C++17** standard required (enforced via `CXX_STANDARD 17 CXX_STANDARD_REQUIRED YES CXX_EXTENSIONS NO`)
+- **CUDA 12.x or 13.x** 
+- **Host compilers**: `g++-13`, `g++-11` (ARM64), `clang++-21`, `cl` (MSVC 14.44,MSVC 14.50), `clang-cl`
+- Only **nvcc** is supported as the CUDA compiler 
+- **Ninja** generator is used in CI; Visual Studio generator also works on Windows
+
+### CMake Options
+| Option | Default | Description |
+|---|---|---|
+| `ENABLE_CPU` | ON | Enable tests on CPU backend |
+| `ENABLE_CUDA` | ON (if nvcc found) | Enable tests on CUDA backend |
+| `BUILD_TEST` | ON | Build integration tests under `tests/` |
+| `BUILD_UTEST` | ON | Build unit tests under `utests/` |
+| `ENABLE_BENCHMARK` | OFF | Build benchmarks under `benchmarks/` |
+| `CUDA_ARCH` | `native` | Target CUDA architectures (e.g., `native`, `all`, `89`, `86;89`) |
+
+### Build Commands (Linux)
+```bash
+#setup compilers
+
+export PATH=/home/cudeiro/cmake-4.2.1-linux-aarch64/bin/:$PATH
+export CUDACXX=/usr/local/cuda-12.9/bin/nvcc  #can be 13.0 or 13.2 but only on x86_64 linux
+export CC=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
+export CXX=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
+# Configure
+cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
+
+# Build
+cmake --build build --config Release
+
+# Test
+cd build && ctest --build-config Release --output-junit test_results.xml
+```
+
+### Build Commands (Windows, in VS Developer Shell with Ninja)
+```powershell
+# Set compilers via env vars (as CI does)
+# note:CUDA Toolkit v12.9 can also be 13.0 or 13.2 but only 13.2 supports 14.50 developer tools (MSVC 2026)
+$env:CUDACXX = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe"
+$env:CC = "cl"  # or "clang-cl"
+$env:CXX = "cl"
+
+cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
+cmake --build build --config Release
+```
+
+### Known Windows Workaround
+On Windows with Ninja, the generated `CMakeFiles/rules.ninja` may have an empty path for nvcc. The CI workaround patches it:
+```powershell
+(Get-Content build\CMakeFiles\rules.ninja) -replace "\\nvcc\\bin\\nvcc.exe", $env:CUDACXX | Set-Content build\CMakeFiles\rules.ninja
+```
+
+## CI Workflows
+
+Workflows trigger on **pull requests** targeting branches matching `LTS-C*`. All runners are **self-hosted**.
+
+| Workflow | Runner | Compilers | CUDA versions |
+|---|---|---|---|
+| `cmake-linux-amd64.yml` | `linux, x64` | `g++-13`, `clang++-21` | 12.9,13.0,13.2 |
+| `cmake-linux-arm64.yml` | `linux, arm64` | `g++-11`, `clang++-21` | 12.9 |
+| `cmake-windows-amd64.yml` | `windows, x64` | `cl`, `clang-cl` (LLVM 21.1.0) | 12.9,13.0,13.2 |
+
+Compilers are set via `CC`, `CXX`, `CUDACXX` environment variables in the "Set reusable strings" step — not as CMake `-D` flags.
+
+## Test Infrastructure
+
+### How Tests Are Discovered
+CMake auto-discovers tests from `.h` files in `tests/` and `utests/` subdirectories using `discover_tests()` in `cmake/tests/discover_tests.cmake`. For each `.h` file:
+- A `.cpp` target is generated (CPU backend) unless the file contains `ONLY_CU`
+- A `.cu` target is generated (CUDA backend) unless the file contains `ONLY_CPU`
+- Files matching `*_common*` are excluded from auto-discovery
+
+A `configure_file()` step generates a launcher from `tests/launcher.in` that includes the test header and calls `launch()`.
+
+### Test Conventions
+- Each test `.h` file must define a function `int launch()` that returns 0 on success
+- Tests that are CPU-only contain the string `ONLY_CPU` (as a marker, not necessarily as a macro)
+- Tests that are CUDA-only contain the string `ONLY_CU`
+- Tests link against `FKL::FKL` (the header-only interface library)
+
+### Adding a New Test
+1. Create a `.h` file in an appropriate subdirectory of `tests/` or `utests/`
+2. Include the necessary FKL headers
+3. Define `int launch() { ... return 0; }`
+4. Add `ONLY_CPU` or `ONLY_CU` in a comment if needed to restrict to one backend
+
+## Core Concepts
+
+### Operation Types
+Operations are classified by their `InstanceType` member (defined in `operation_types.h`):
+
+| Type | exec signature | Description |
+|---|---|---|
+| `ReadType` | `OutputType exec(Point, ParamsType)` | Reads from memory |
+| `WriteType` | `void exec(Point, InputType, ParamsType)` | Writes to memory |
+| `UnaryType` | `OutputType exec(InputType)` | Pure computation, no params |
+| `BinaryType` | `OutputType exec(InputType, ParamsType)` | Computation with params |
+| `ReadBackType` | `OutputType exec(Point, ParamsType, BackIOp)` | Read with backward-fused op |
+| `IncompleteReadBackType` | `` | ReadBackType that has no info on the BackIOp type and has no exec function, but can store params |
+| `TernaryType` | `OutputType exec(InputType, ParamsType, BackIOp)` | Compute with params and backward op |
+| `MidWriteType` | `InputType exec(Point, InputType, ParamsType)` | Writes and passes input through |
+| `OpenType` | `OutputType exec(Point, InputType, ParamsType)` | Gets the input in registers via InputType parameter, and returns result in registers with OutputType. It can have a MidWrite Operation internally |
+| `ClosedType` | `void exec(Point, ParamsType)` | Reads from memory and writes the results to memory, for the coordinate passed in Point. It effectively performs a transform on each coordinate. |
+
+### Instantiable Operations (IOps)
+Operations are wrapped in `InstantiableOperation` structs that hold runtime parameters. Aliases:
+- `fk::Read<Op>`, `fk::Write<Op>`, `fk::Unary<Op>`, `fk::Binary<Op>`, `fk::Ternary<Op>`, `fk::ReadBack<Op>`, `fk::MidWrite<Op>`, `fk::Open<Op>`, `fk::Closed<Op>`
+- Use `fk::Instantiable<Op>` to automatically select the right wrapper based on `Op::InstanceType`
+
+Instantiable Operations (IOps) are constructed via a static `build(...)` method on each Operation that returns the wrapped IOp.
+
+### Data Parallel Patterns (DPPs)
+DPPs determine how threads are organized. The main one is `TransformDPP<THREAD_FUSION>` (where `THREAD_FUSION` defaults to `false`). Pass the DPP as the first template argument to `executeOperations`.
+
+### Key Data Types
+- `fk::Ptr1D<T>` / `fk::Ptr2D<T>` / `fk::Ptr3D<T>` — 1D/2D/3D pitched GPU pointers
+- `fk::Tensor<T>` — contiguous multi-plane GPU array
+- `fk::Size` — width/height size
+- `fk::Rect` — x, y, width, height rectangle
+- `fk::Point` — thread index (x, y, z)
+- `fk::Tuple<Ts...>` — GPU-safe tuple (use instead of `std::tuple` in device code)
+- `fk::Stream` / `fk::Stream_<ParArch::GPU_NVIDIA>` — CUDA stream wrapper
+
+### Fusion API (`.then()` and `operator&`)
+IOps support chaining:
+```cpp
+auto fusedIOp = readIOp.then(unaryIOp1).then(unaryIOp2).then(writeIOp);
+// equivalent to
+auto fusedIOp = readIOp & unaryIOp1 & unaryIOp2 & writeIOp;
+```
+
+### Compiler Macros (`compiler_macros.h`)
+- `_MSC_VER_EXISTS` — 1 when compiling with MSVC
+- `FK_HOST_DEVICE_CNST`, `FK_HOST_FUSE`, `FK_DEVICE_FUSE`, etc.
+- CNST means __forceinline__ constexpr with nvcc, inline constexpr with CPU compilers.
+- FUSE means __forceinline__ static constexpr with nvcc, inline static constexpr with CPU compilers.
+- HOST means __host__ with nvcc, nothing with CPU compilers.
+- DEVICE  means __device__ with nvcc, nothing with CPU compilers.
+
+## Code Style
+
+- **Formatting**: LLVM-based, 4-space indent, 120-column limit (`.clang-format` in repo root)
+- **C++ Standard**: C++17 strictly (no extensions)
+- **Copyright header**: Every file begins with an Apache 2.0 license header
+- **Include guards**: `#ifndef FK_XXX_H` / `#define FK_XXX_H` (not `#pragma once`)
+- **Namespace**: All public API is in namespace `fk` except for vector types
+- **Templates**: Heavy use of SFINAE (`std::enable_if_t`), type traits, and variadic templates
+- **No exceptions in device code**: Only host code uses `std::runtime_error`
+- **Pointer alignment**: Right (i.e., `T* ptr`, not `T *ptr`)
+
+## CUDA Architecture Notes
+
+- Minimum supported compute capability: **7.0** (sm_70, Volta)
+- `CUDA_ARCH=native` (default) auto-detects via `nvidia-smi` for CUDA < 13
+- For CUDA 12: curand DLL is `curand64_11`, cufft DLL is `cufft64_11`
+- For CUDA 13: curand DLL is `curand64_10`, cufft DLL is `cufft64_12`; DLLs are in `x64/` subdirectory
+
+## Common Errors and Workarounds
+
+1. **Windows/Ninja: empty nvcc path in `rules.ninja`** — Apply the `rules.ninja` patch in CI (`cmake-windows-amd64.yml` step "Configure CMake").
+2. **CUDA < 13 + `CUDA_ARCH=all`** — The build system automatically filters out GPU architectures below sm_70.
+3. **Template depth** — `TEMPLATE_DEPTH` is set to 1000 via `cmake_init.cmake` for deeply nested fusion expressions.
+4. **`/bigobj` on MSVC** — Required due to large generated test binaries; added automatically in `add_generated_test.cmake`.
+5. **`/Zc:preprocessor` on MSVC** — Required to avoid traditional preprocessor warnings; added in `add_generated_test.cmake`.
+
+## How to Add a New Operation
+
+1. Create a struct in `include/fused_kernel/algorithms/` with:
+   - `private: using SelfType = StructName<TemplateTypes...>;`
+   - `using Parent = /*Parent operation according to the InstanceType of the operation*/;`
+   - `public: FK_STATIC_STRUCT(StructName, SelfType)`
+   - `DECLARE_/*use macro according to InstanceType*/_PARENT`
+   - A `FK_HOST_DEVICE_FUSE` `exec(...)` function matching the InstanceType signature
+2. If the operation needs a `build()` factory, wrap it in an `Instantiable<YourOp>` specialization or provide a custom `build()` static method
+3. Add a test `.h` in `utests/` with `int launch()` to exercise it
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -18,13 +18,9 @@ HOMEPAGE_URL "https://github.com/morousg/FusedKernelLibrary" )
 
 #cuda is optional, but if it is found, it will be used
 option(ENABLE_CPU "Enable CPU support" ON)
-
-#we don't need the CPU backend with Visual Studio 2017 
-if (MSVC_VERSION LESS 1920) 
-    set(ENABLE_CPU OFF  CACHE BOOL "Disable CPU support with MSVC < 2019" FORCE)
-    message(AUTHOR_WARNING "MSVC Compiler is older than Visual Studio 2019. Disabling CPU backend.")
+if (MSVC AND MSVC_VERSION LESS 1930)
+    message(FATAL_ERROR "Visual Studio 2019 and earlier (MSVC_VERSION < 1930) are not supported. Please use Visual Studio 2022 or later.")
 endif()
-
 include(CheckLanguage)
 check_language(CUDA)
 if (CMAKE_CUDA_COMPILER)

diff --git a/include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h b/include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h
@@ -363,6 +363,16 @@ FK_HOST_CNST auto then(const ContinuationIOp& cIOp, const ContinuationIOps&... c
         using type = Write<Operation>;
     };
 
+    template <typename Operation>
+    struct InstantiableOperationType<Operation, std::enable_if_t<opIs<OpenType, Operation>>> {
+        using type = Open<Operation>;
+    };
+
+    template <typename Operation>
+    struct InstantiableOperationType<Operation, std::enable_if_t<opIs<ClosedType, Operation>>> {
+        using type = Closed<Operation>;
+    };
+
     template <typename Operation>
     using Instantiable = typename InstantiableOperationType<Operation>::type;
 } // namespace fk