Hinweis: Vage Einträge ohne messbares Ziel, Interface-Spezifikation oder Teststrategie mit
<!-- TODO: add measurable target, interface spec, test strategy -->markieren.
The Acceleration module (src/acceleration/) provides hardware-accelerated compute backends for vector similarity search, graph traversal, geospatial computation, and matrix operations. It encompasses CUDA (cuda_backend.cpp, cuda/), HIP/ROCm (hip_backend.cpp, hip/), Vulkan (vulkan_backend_full.cpp, vulkan/shaders/), DirectX 12 (directx_backend_full.cpp, directx/shaders/), Metal (metal_backend.mm), OpenCL (opencl_backend.cpp), OneAPI (oneapi_backend.cpp), multi-GPU collectives via NCCL/RCCL (nccl_vector_backend.cpp, rccl_vector_backend.cpp), FAISS GPU indexing (faiss_gpu_backend.cpp), and the BackendRegistry (backend_registry.cpp) that selects the best available backend at runtime. The plugin_loader.cpp / plugin_security.cpp subsystem extends the registry with dynamically loaded GPU backends. Enhancements to AQL execution planning or higher-level query routing are out of scope; CPU fallback paths are included only where they affect GPU parity or benchmarking.
[ ]Hardware portability: all enhancements must preserveBackendRegistry's fallback toCPUVectorBackend/CPUGraphBackend/CPUGeoBackend/CPUMatrixBackendwhen no GPU is present; verified by CI runs withTHEMIS_ENABLE_CUDA=OFF THEMIS_ENABLE_VULKAN=OFF.[ ]Plugin ABI stability:plugin_loader.cppandplugin_security.cppdefine a versioned plugin contract; GPU backend plugins must not alter that ABI before v2.0.[ ]Memory budget: GPU device memory is finite and shared; all backends must honour the per-operation memory cap exposed viaBackendCapabilities::maxMemoryBytesand theVLLMResourceManager::Config::max_gpu_vram_mblimit.[ ]CUDA/Vulkan/HIP/DirectX SDK optionality: the build must succeed without any GPU SDK installed (#ifdef THEMIS_ENABLE_CUDA/THEMIS_ENABLE_VULKAN/THEMIS_ENABLE_HIP/THEMIS_ENABLE_DIRECTXguards must remain in all GPU paths).[ ]IComputeBackendinterface must never throw; all errors are returned viaErrorContextor empty-result sentinel values; the interface contract is documented incompute_backend.h.[ ]BackendRegistryis a process-wide singleton; all public methods must be safe to call from multiple threads afterinitializeRuntime()returns.
| Interface | Consumer | Notes |
|---|---|---|
BackendRegistry::instance() |
AQL executor, vector index | Singleton; must remain thread-safe after multi-GPU registration |
IVectorBackend::batchKnnSearch() |
faiss_gpu_backend.cpp, nccl_vector_backend.cpp, multi_gpu_backend.cpp |
Signature frozen for v1.x |
PluginLoader::loadPlugin() |
Dynamic GPU backends (zluda_backend.cpp, oneapi_backend.cpp) |
Plugin security sandbox enforced by plugin_security.cpp |
VLLMResourceManager::canUseGPU() |
Acceleration paths sharing GPU with vLLM inference | Must not block indefinitely; lease timeout required |
NCCLVectorBackend::mergeTopK() |
multi_gpu_backend.cpp distributed ANN search |
Distributed path unimplemented; see Planned Features |
IAsyncComputeDispatch |
Future async search pipeline | Defined in include/acceleration/FUTURE_ENHANCEMENTS.md; not yet implemented in src |
Priority: High Target Version: v1.7.0 Status: ✅ Implemented
cuda_backend.cpp kernel launch surfaces (launchL2DistanceKernel, launchCosineDistanceKernel, launchTopKKernel, …) are implemented and wired through cuda/ann_kernels.cu, cuda/vector_kernels.cu, and cuda/cuda_hnsw_kernels.cu. cuBLAS batched GEMM remains the target path for L2/cosine distance; CUB DeviceSegmentedSort is used for top-k selection [6].
Implementation Notes:
[x].cukernel files (cuda/ann_kernels.cu,cuda/vector_kernels.cu) are implemented and wired intoCUDAVectorBackend, including HNSW traversal dispatch (cuda/cuda_hnsw_kernels.cu).[x]Cosine distance: fuse L2-norm and dot-product into a single tiled kernel to avoid a second pass over device memory (IO-aware pattern per FlashAttention [3]).[x]Top-k (k ≤ 1024): use CUBDeviceSegmentedSort[6]; for k > 1024 fall back tothrust::partial_sort.[x]AddCUDA_ARCHcompile-time guard: require sm_70+ (Tensor Core availability); emit warning for sm_60.
Performance Targets:
- 1M × 128-dim float32 L2 search in < 8 ms on RTX 3090 (single GPU).
- Throughput ≥ 10× CPU AVX2 baseline measured by
benchmarks/vector_bench.cpp. - GPU memory footprint < 2 GB for 10M 128-dim vectors.
API Sketch:
// cuda_backend.cpp — completed signature (currently stub)
std::vector<SearchResult> CUDAVectorBackend::batchSimilaritySearch(
const float* queries, // host pointer, [numQueries × dim]
size_t numQueries,
size_t dim,
DistanceMetric metric,
size_t topK,
const SearchOptions& opts) override;Priority: High Target Version: v1.9.0 Status: ✅ Implemented
nccl_vector_backend.cpp and rccl_vector_backend.cpp now implement the distributed multi-rank mergeTopK() path using ncclAllGather/rcclAllGather + host-side std::partial_sort + ncclBcast/rcclBcast.
Implementation Notes:
[x]NCCL/RCCL communicator initialized; single-rank copy path implemented.[x]Multi-rank gather inNCCLVectorBackend::mergeTopK(): usesncclAllGatherinsidencclGroupStart/ncclGroupEndto collect per-GPU top-K (indices + distances) from all ranks; host-sidestd::partial_sortselects global top-k;ncclBcastfrom root broadcasts result.[x]Mirror identical fix inRCCLVectorBackend::mergeTopK(): usesrcclAllGather+rcclBcastinsidercclGroupStart/rcclGroupEnd.[x]ncclGroupStart()/ncclGroupEnd()bracket pipelining both AllGather calls and both Bcast calls.[x](void)root; (void)stream;suppression lines removed.[x]Tests added totests/test_collective_backends.cppvalidating single-rank copy correctness and k > localK rejection.[x]Integration testtests/acceleration/test_nccl_merge_topk.cppadded: 13 CPU-side merge-algorithm simulation tests (worldSize ∈ {2, 4, 8}, k ∈ {10, 100, 256}) plus 6 NCCL single-rank device tests (skipped without hardware) and 6 RCCL single-rank device tests (skipped without hardware); registered intests/CMakeLists.txtasNCCLMergeTopKFocusedTests.
Performance Targets:
- 100M × 128-dim index distributed across 4× A100 80 GB; p99 query latency < 15 ms for k=100.
mergeTopKoverhead < 500 µs for worldSize=4, k=100 on NVLink-3 interconnect.- Linear scaling efficiency ≥ 75% from 1→4 GPUs measured by
benchmarks/multi_gpu_bench.cpp.
Priority: High Target Version: v1.8.0 Status: ✅ Implemented
plugin_security.cpp now enforces revocation checks in production code paths:
PluginSecurityVerifier::checkCRL()fetches CRLs from distribution points, parses DER CRLs, verifies signatures and validity windows, checks certificate serials, and caches results by serial.PluginSecurityVerifier::checkOCSP()builds OCSP requests, performs HTTP POST to responders, verifies OCSP basic response signatures and validity windows, and caches results by serial.EnhancedPluginSecurityVerifier::verifyFullChain()now wires CRL/OCSP checks into full-chain verification and fails closed when revocation checking is required and neither check confirms not-revoked status.
Implementation Notes:
[x]ImplementcheckCRL()with HTTP fetch, DER parse (d2i_X509_CRL), signature validation, serial check (X509_CRL_get0_by_cert), andthisUpdate/nextUpdatevalidation.[x]ImplementcheckOCSP()with OpenSSL request/response APIs (OCSP_REQUEST_new,OCSP_resp_find_status,OCSP_basic_verify) and timestamp checks.[x]Implement PE certificate table extraction for embedded PKCS#7 blobs inextractEmbeddedCertificate().[x]Add CRL/OCSP cache keyed by certificate serial with expiry derived from CRLnextUpdate(or default) and OCSPnextUpdate/1h TTL.[x]Add focused revocation tests intests/test_plugin_security_crl_ocsp.cpp.
Priority: Medium Target Version: v1.8.0 Status: ✅ Implemented
EnhancedPluginSecurityVerifier::extractEmbeddedCertificate() in plugin_security.cpp iterates the full PE certificate table and returns the first PKCS#7 DER blob. The Linux ELF path parses the .note.gnu.signature section or falls back to a sidecar .sig file.
Implementation Notes:
[x]Parse PE optional-header data directories: seek toe_lfanew + 0x18 + offsetof(OptionalHeader, DataDirectory[4]), read theVirtualAddressandSizefields for the Security directory (entry 4,IMAGE_DIRECTORY_ENTRY_SECURITY).[x]Map the certificate table: for eachWIN_CERTIFICATErecord in the table, checkwCertificateType == WIN_CERT_TYPE_PKCS_SIGNED_DATAand extractbCertificate[dwLength - offsetof(WIN_CERTIFICATE, bCertificate)]as a DER blob.[x]For ELF plugins on Linux: look for a.note.gnu.signaturesection or a sidecarplugin.so.sigfile; fall back to returning an empty string (unsigned) rather than leaving the code path unreachable.[x]Return the first valid PKCS#7 DER blob; log a warning if multiple certificates are present.[x]Add a fixture-based unit test with a pre-signed PE test binary to validate extraction end-to-end.
Priority: Medium Target Version: v1.8.0 Status: ✅ Implemented
VLLMResourceManager::getStats() now returns real OS-level CPU and RAM metrics.
Implementation Notes:
[x]Linux CPU monitoring: reads/proc/staton two 100 ms-apart snapshots and computes(total - idle) / total * 100.0.[x]Linux RAM monitoring: parses/proc/meminfofieldsMemTotalandMemAvailable; computesram_used_mb = (MemTotal - MemAvailable) / 1024.[x]Windows CPU monitoring: callsGetSystemTimes()with 100 ms delta; computes(1 - idle/total) * 100.0.[x]Windows RAM monitoring: callsGlobalMemoryStatusEx()and readsdwMemoryLoadandullTotalPhys - ullAvailPhys.[x]Gated behind#ifdef __linux__/#ifdef _WIN32; macOS/unknown returns0.0(safe fallback).[x]Tests added intests/test_vllm_resource_stats.cpp:cpu_utilization ∈ [0, 100],ram_used_mb > 0, uninitialised guard returns zeros.
Performance Targets:
- Each
getStats()call completes in < 2 ms (single/proc/stat+/proc/meminforead on Linux).
Priority: Medium Target Version: v1.8.0 Status: ✅ Implemented
VLLMResourceManager::initializeNVML() previously hard-coded nvmlDeviceGetHandleByIndex(0, &device).
Implementation Notes:
[x]Addedgpu_device_indexfield (default0) toVLLMResourceManager::Config;initializeNVML()now callsnvmlDeviceGetHandleByIndex(config_.gpu_device_index, &device).[x]Addedgpu_device_indicesvector override toConfig; when non-empty,initializeNVML()opens handles for all listed devices and stores them innvml_devices_.[x]queryGPUUtilization()returns the maximum utilization across all monitored devices; a single busy GPU blocks new ThemisDB work.[x]shutdownNVML()clearsnvml_devices_before callingnvmlShutdown(), ensuring all device handles are released first.[x]5 tests intest_vllm_resource_stats.cppvalidating config fields, multi-device init without CUDA, and single non-zero device index.[x]canUseGPU()fixed to iterate allnvml_devices_(not just the primary aliasnvml_device_) in the async task, ensuring max-utilisation semantics apply to the busy-check as well asqueryGPUUtilization().[x]setGpuUtilizationProviderForTesting()injection seam added toVLLMResourceManager;canUseGPU()andqueryGPUUtilization()call the provider instead of NVML when set.[x]8 mock-provider tests added totest_vllm_resource_stats.cpp(CI-only, no GPU hardware required):MockProvider_CanUseGPU_ReturnsFalse_At90Percent— configured device busyMockProvider_CanUseGPU_ReturnsTrue_WhenIdle— configured device idleMockProvider_CanUseGPU_ReturnsFalse_WhenConfiguredDeviceAt90_Gpu0Idle— core acceptance-criterion test: device 2 busy, GPU 0 idleMockProvider_CanUseGPU_ReturnsFalse_WhenAnyMonitoredDeviceBusy— multi-device max semanticsMockProvider_QueryGPUUtilization_ReflectsProvider— getStats() propagates provider valueMockProvider_CanUseGPU_ReturnsFalse_WhenNullopt— nullopt treated as busyMockProvider_CanUseGPU_At79Percent_AllowsUse— boundary below thresholdMockProvider_CanUseGPU_At80Percent_Blocks— boundary at threshold
[x]CI workflow added:.github/workflows/02-feature-modules_acceleration_vllm-multi-gpu-nvml-monitoring-ci.yml
Priority: Medium Target Version: v1.8.0 Status: ✅ Fully implemented (Issue #132)
cuda/cuda_hnsw_kernels.cu previously defined static constexpr uint32_t kMaxK = 256u and silently truncated results when k > 256 was requested.
Implementation Notes:
[x]kMaxK increased from 256 → 512 → 1024 incuda/cuda_hnsw_kernels.cu.[x]Silent clamp replaced with an explicitbool* h_overflowoutput flag inlaunchHnswSearchKernel; overflow is set and the kernel is NOT launched whenk > kMaxKso the caller can take corrective action.[x]Result buffers (res_dist,res_id) moved from fixed-size local arrays to dynamically allocated shared memory viaextern __shared__; block size is computed at launch time asmin(128, 48KB / (k * 8))to respect SM shared-memory limits.[x]entry_nodeparameter added to the kernel to support multi-pass searches from non-zero starting nodes.[x]computeThreadsPerBlock(k)helper added to compute the optimal block dimension for a given k.[x]For k > 1024 (extreme re-ranking): multi-pass strategy implemented inCudaHnswTraversalEngine::batchSearch()— runsceil(k / kMaxK)GPU passes from diverse entry nodes, merges results on host usingstd::partial_sort, deduplicates by node ID.[x]Debug guard:__trap()fired in debug builds (!NDEBUG) ifk > kMaxKreaches the launcher, ensuring callers do not inadvertently rely on overflow behavior.[x]Release builds: overflow condition propagated asAccelerationErrorCode::InvalidInputShapeviasetError()inCUDAVectorBackend::annBatchSearch()/batchKnnSearch(); makesgetHealthStatus()returnBackendHealthStatus::makeDegraded()automatically.[x]kHnswSinglePassMaxK = 1024uconstant added tocuda_backend.cppfor consistent threshold checks.[x]Test:tests/test_cuda_hnsw_large_k.cppwith k=257, k=512, k=1024, k=1025 (multi-pass), health-degraded, sort-order, multi-query tests (7 test cases total).[x]CI workflow:.github/workflows/02-feature-modules_acceleration_cuda-hnsw-large-k-ci.ymltriggers on changes to kernel / traversal / backend / test files.
Performance Targets:
- k=256: no regression vs. prior implementation (same block size, same shared memory layout).
- k=1024 with dynamic shared memory: block size reduced to 4 threads/block; total SM usage ≤ 32 KB; functional on all SM 2.0+ devices.
- k > 1024: multi-pass strategy returns correct result count at the cost of increased latency (documented trade-off; no RTX 3090 target for extreme-k path).
Priority: Medium Target Version: v1.9.0 Status: ✅ Production Ready
cuda/cuda_hnsw_kernels.cu previously allocated num_queries × num_nodes × sizeof(uint8_t) bytes per kernel launch — 5 GB for 512 queries × 10M nodes.
Implementation Notes:
[x]Switched fromuint8_tper-node to 1-bit-per-node bitset: allocation is nowceil(num_nodes / 8)bytes per query (10M nodes → 1.25 MB per query, 512 queries → 640 MB — 8× reduction).[x]Kernel updated to use bitset read (visited[nb >> 3] & (1u << (nb & 7u))) and write (visited[nb >> 3] |= (1u << (nb & 7u))) operations.[x]Initialisation loop reduced fromnum_nodestoceil(num_nodes/8)iterations.[x]Replace per-invocationcudaMalloc/cudaFreewith a persistent pre-allocated pool owned byCudaHnswTraversalEngine::Impl::d_visited_pool; allocated once inbuildIndex()atmaxBatchSize × ceil(numNodes/8)bytes; eliminates per-launch allocation overhead.[x]Chunked batch processing for graphs where bitset pool cannot cover all queries:batchSearch()splitsnumQueriesinto sub-batches of at mostpool_capacityqueries, processes them serially, and concatenates results on the host.[x]ExposedCudaHnswTraversalEngine::setMaxBatchSize(size_t n)andCUDAVectorBackend::setMaxBatchSize(size_t n)so callers can tune pool allocation.[x]Pool allocation failure surfaces asBackendHealthStatus::makeDegraded()viasetError()inCUDAVectorBackend::buildHnswAnnIndex().[x]Pool size is clamped to 90% ofBackendCapabilities::maxMemoryBytesduringbuildHnswAnnIndex().
Performance Targets:
- Pool allocation must not exceed
BackendCapabilities::maxMemoryBytesat construction time. - Per-query
cudaMalloc/cudaFreeround trips eliminated; visited-pool reuse reduces HNSW launch overhead by ≥ 15% for repeated fixed-batch queries.
Priority: Medium Target Version: v1.9.0
Multiple CUDA and HIP kernel launchers use hard-coded block dimensions that are not tuned for actual device occupancy:
cuda/ann_kernels.cu:366:constexpr int kThreadsPerBlock = 256;cuda/vector_kernels.cu:359:int threadsPerBlock = 256;cuda/geo_kernels.cu:151,181:constexpr int kBlockSize = 256;cuda/graph_kernels.cu:248:static constexpr int kBFSBlockDim = 256;hip/ann_kernels.hip:367:constexpr int kThreadsPerBlock = 256;hip/geo_kernels.hip:154,184:constexpr int kBlockSize = 256;hip_backend.cpp:602:int threadsPerBlock = 256;
A fixed block size of 256 is a reasonable default for NVIDIA sm_86 and AMD RDNA2, but may underperform on GPUs with 64-thread wavefronts (AMD GCN2) or on sm_90 (Hopper) where 128-thread blocks better utilize the warp scheduler.
Implementation Notes:
[x]Replace hard-codedthreadsPerBlock = 256incuda/vector_kernels.cu:359andhip_backend.cpp:602with a runtime call tocudaOccupancyMaxPotentialBlockSize()/hipOccupancyMaxPotentialBlockSize()atinitialize()time; store the result in the backend'sImplstruct and pass it to all kernel launches.[x]Forconstexprblock sizes in.cu/.hipfiles (ann_kernels.cu,geo_kernels.cu,graph_kernels.cu), expose a launch wrapper that acceptsthreadsPerBlockas a parameter and is called from the backend with the occupancy-tuned value rather than hard-coding the constant at the launch site.[x]For AMD GCN targets (wavefront = 64): default to 64 threads whenhipGetDeviceProperties().warpSize == 64to avoid half-occupancy.[x]Vulkanl2_distance.comphard-codeslayout(local_size_x = 16, local_size_y = 16): expose this as a specialization constant (layout(constant_id = 0) const uint LOCAL_SIZE_X = 16) so theVulkanVectorBackendcan inject the optimal value for the target device viaVkSpecializationInfoat pipeline creation time. Alsobatch_search.complocal_size_x = 256is now a specialization constant.[x]Add a micro-benchmark (benchmarks/kernel_block_size_bench.cpp) that sweeps block sizes 64/128/256/512 for each kernel and reports achieved occupancy.
Performance Targets:
- ≥ 5% throughput improvement on AMD RDNA2 (wavefront=32) vs. 256-thread baseline.
- No regression on NVIDIA sm_86/sm_89 (Ampere/Ada).
Priority: Low Target Version: v1.8.0
backend_registry.cpp uses std::cout for all diagnostic output (lines 136, 143, 167, 311, 335, 340, 359, 417, 438, 442) despite the codebase providing a structured logger via utils/logger.h (THEMIS_INFO, THEMIS_WARN, THEMIS_ERROR, THEMIS_DEBUG macros). The inconsistency means backend-selection events are invisible when the calling application redirects or suppresses std::cout, and they cannot be structured-logged (JSON, syslog) by the logging framework.
Implementation Notes:
[x]Replace allstd::cout << "Registered backend: ..."(line 136) withTHEMIS_INFO("Registered backend: {} (type={})", backend->name(), static_cast<int>(backend->type())).[x]Replace allstd::cout <<calls inautoDetect(),initializeRuntime(),shutdownAll(),loadPlugins(),loadPlugin()with the appropriate severity-level macro (THEMIS_INFOfor status,THEMIS_WARNfor degraded paths,THEMIS_DEBUGfor verbose capability dumps).[x]ThelogSelectionlambda ininitializeRuntime()(line 435) already usesstd::cout; convert it toTHEMIS_INFO/THEMIS_WARN.[x]Ensureutils/logger.his already included inbackend_registry.cpp(it is used forTHEMIS_ERRORon line 180 but#include "utils/logger.h"is already present).
Priority: Medium Target Version: v1.8.0 Status: ✅ Implemented
BackendRegistry is now thread-safe. All mutable state is protected by mutable std::shared_mutex registryMutex_.
Implementation Notes:
[x]Addedmutable std::shared_mutex registryMutex_toBackendRegistryincompute_backend.h;<shared_mutex>and<atomic>included.[x]Exclusive lock (std::unique_lock) held inregisterBackend(),shutdownAll(), and the write phase ofinitializeRuntime().[x]Shared lock (std::shared_lock) held in allgetBackend*(),selectBackendFor*(),getBestBackend*(),getAvailableBackends(),deviceInfo(),getSelected*Backend()methods.[x]runtimeInitialized_converted tostd::atomic<bool>; read withmemory_order_acquire, written withmemory_order_release.[x]selectTyped<T>()documented with "callers must hold at least a shared lock" comment.[x]Thread-safety tests added totest_backend_registry_startup.cpp: 16-thread concurrentgetBestVectorBackend, readers +getAvailableBackendswriter,isRuntimeInitializedconcurrency.[x]Dedicated thread-safety test file added attests/acceleration/test_backend_registry_thread_safety.cpp: 16 reader threads callinggetBestVectorBackend()concurrently while a background writer callsregisterBackend()with a lightweight in-process stub (avoids plugin scanning noise); verifying no crashes. For data-race detection run locally with-fsanitize=thread.
Priority: Low Target Version: v1.9.0 Status: ✅ Implemented
The selectTyped<T>() helper in backend_registry.cpp:223–233 iterates the entire kFallbackOrder vector (13 entries) and for each entry scans all registered backends in backends_. In the current implementation with ~15 backends this is negligible, but it is called for every query that needs backend selection (selectVectorBackendFor, selectGraphBackendFor, selectGeoBackendFor, selectMatrixBackendFor, getBestVectorBackend, etc.). More importantly, the nested loop requires O(|kFallbackOrder| × |backends_|) dynamic_cast calls per selection.
Implementation Notes:
[x]At the end ofinitializeRuntime(), build astd::unordered_map<BackendType, IComputeBackend*>index frombackends_; replace the nested loop inselectTyped<T>()with a single map lookup per priority level. —typeIndex_(unordered_map<BackendType, RegisteredBackend>) is populated inregisterBackend()and used byselectTyped<T>()for O(|kFallbackOrder|) typed selection;getBackend()also uses the map for O(1) lookup.[x]Pre-compute and cachegetBestVectorBackend()/getBestGraphBackend()/getBestGeoBackend()results intoselectedVectorBackend_etc. as is already partially done; ensuregetBackend(type)also uses the map. —getBestVectorBackend/GraphBackend/GeoBackend/MatrixBackend()iteratekFallbackOrderand look uptypeIndex_for O(|kFallbackOrder|) with no dynamic_cast;getBackend()usestypeIndex_for O(1).[x]Avoiddynamic_castin the hot path: store typed pointers (IVectorBackend*,IGraphBackend*,IGeoBackend*) alongside theIComputeBackend*in aRegisteredBackendstruct atregisterBackend()time (onedynamic_castper registration, not per query). —RegisteredBackendstruct incompute_backend.hholdsbase,vectorPtr,graphPtr,geoPtr,matrixPtr; all casts done once inregisterBackend().
Priority: Medium Target Version: v1.9.0 Status: ✅ Implemented
Implementation Notes:
[x]AddedMatrixPrecision::INT8 = 3to theMatrixPrecisionenum inkernel_invocation.h.[x]AddedINT8case indispatchMatmul()(tensor_core_matmul.cpp) that dispatches tolaunchINT8MatmulKernel().[x]ImplementedlaunchINT8MatmulKernel()incuda/tensor_core_matmul.cuusingcublasGemmExwithCUDA_R_8Iinputs,CUDA_R_32Iaccumulator, andCUBLAS_GEMM_DEFAULT_TENSOR_OP; includes runtime SM 7.5+ guard (returns 1 on older hardware).[x]UpdatedCUDAMatrixBackend::getCapabilities()to advertisePrecisionMode::INT8only whensm >= 75(Turing+).[x]quantize()/dequantize()FP32↔INT8 helpers added totensor_core_matmul.h/tensor_core_matmul.cpp; symmetric per-tensor quantisation with clamp and round, guard for null pointers / non-positive scale.
Performance Targets:
- INT8 matmul throughput ≥ 2× FP16 throughput on RTX 3090 (sm_86) for 4096×4096 matrices.
Priority: Medium Target Version: v1.9.0 Status: ✅ IMPLEMENTED
faiss_gpu_backend.cpp now implements all six index types. IVF_SQ8 uses
GpuIndexIVFScalarQuantizer with QT_8bit for higher recall than PQ at
equivalent memory. HNSW_FLAT uses CPU-side faiss::IndexHNSWFlat which
exposes the same IVectorBackend interface and is preferred for
low-latency single-query search. All switch statements include default:
branches that set lastError_ via setError(). Input validation guards
(null pointers, zero sizes, empty paths, negative dimension) added to all
public methods. getCapabilities() now advertises FP32 | INT8 precisions
and L2 | INNER_PRODUCT metric bits. Tests in tests/test_faiss_gpu_backend.cpp
(25 GPU tests + 15 validation + 10 structural).
Resolved checklist:
[x]AddIndexType::IVF_SQ8—GpuIndexIVFScalarQuantizerwithQT_8bit[x]AddIndexType::HNSW_FLAT— CPU-sidefaiss::IndexHNSWFlat+hnswMconfig field[x]Adddefault:branches withsetError()in all switch statements[x]UpdategetCapabilities()withINT8precision flag and metric bitmask[x]Input validation insearch(),addVectors(),trainIndex(),computeDistances(),batchKnnSearch(),initializeIndex(),saveIndex(),loadIndex()[x]IntroducesetError()helper; replace barestd::cerrerror paths[x]Add 50 unit + integration tests intests/test_faiss_gpu_backend.cpp
Priority: Low Target Version: v2.0.0 Status: ✅ IMPLEMENTED
graphics_backends.cpp header comment and status banner updated to "Stubs: 0". All five stubs are now implemented across three OpenGL backend classes.
Implementation Notes:
[x]ImplementedOpenGLVectorBackend::batchKnnSearch()using the existing EGL + compute-shader infrastructure fromcomputeDistances(): dispatches the L2/cosine GLSL shader, reads back distances, performs top-K on CPU withstd::partial_sort, returnsstd::vector<std::vector<std::pair<uint32_t,float>>>.[x]ImplementedOpenGLGraphBackend::batchBFS()andbatchShortestPath()in the newOpenGLGraphBackendclass (implementsIGraphBackend) using GLSL 4.30 compute shaders (wavefront-parallel BFS with two ping-pong frontier SSBOs; parallel Bellman-Ford with init + relax shaders); CPU fallback BFS queue and Bellman-Ford when EGL unavailable.[x]ImplementedOpenGLGeoBackend::batchDistances()(Haversine) andbatchPointInPolygon()(ray-casting) in the newOpenGLGeoBackendclass (implementsIGeoBackend) using GLSL 4.30 compute shaders ported from the Vulkan GLSL-compatible equivalents; CPU fallback uses the same algorithms asVulkanGeoBackend.[x]Updated the status banner comment to "Stubs: 0".[x]OpenGLVectorBackend::getCapabilities().supportsAsync = falseremains accurate and is documented ingraphics_backends.h. Both new backends (OpenGLGeoBackend,OpenGLGraphBackend) also setsupportsAsync = false.
Priority: High Target Version: v1.7.0 Status: ✅ IMPLEMENTED
vulkan_backend_full.cpp is PRODUCTION-READY (0 stubs). All SPIR-V compute shaders for vector distance and geospatial operations are implemented in vulkan/shaders/: l2_distance.comp, cosine_distance.comp, inner_product_distance.comp, batch_search.comp, topk_selection.comp, haversine_distance.comp, point_in_polygon.comp. The LoRA shaders (matmul.comp, elementwise.comp, gradient.comp, etc.) are also complete.
Remaining Hardening:
[x]MoltenVK path: verifyVK_KHR_buffer_device_addresscapability probe on Apple M-series.[x]Benchmark on Mali-G710 and RDNA2 to validate workgroup size (256 threads forbatch_search.comp, 16×16 forl2_distance.comp) occupancy targets; expose as SPIR-V specialization constants (see Kernel Block-Dimension Occupancy Tuning above).[x]Double-buffer staging buffers to overlap host→device DMA with shader dispatch.
Performance Targets:
- 500K × 128-dim cosine search in < 20 ms on Apple M2 Pro via MoltenVK.
- < 5% throughput regression versus CUDA path on AMD RX 7800 XT.
Priority: Medium Target Version: v1.9.0 Status: ✅ Implemented
MultiGPUVectorBackend in multi_gpu_backend.cpp implements range-based sharding across N GPUs with NCCL/RCCL collective operations for distributed top-k merge, falling back to host-side merge when collectives are unavailable. Registered in BackendRegistry::autoDetect() when detectGPUCount() >= 2.
Implementation Notes:
[x]IntroduceMultiGPUVectorBackendinmulti_gpu_backend.cpp; register it inBackendRegistrywhencudaGetDeviceCount() > 1.[x]Shard by contiguous vector-ID ranges; store shard metadata in astd::vector<ShardDescriptor>on the host.[x]UsencclGroupStart/ncclGroupEndto batch cross-GPU transfers. BothNCCLVectorBackend::mergeTopK()andRCCLVectorBackend::mergeTopK()bracket theAllGatherpair and theBcastpair insidencclGroupStart/ncclGroupEnd(andrcclGroupStart/rcclGroupEnd) respectively.[x]RCCL mirror:rccl_vector_backend.cppexposes the sameIVectorBackendinterface;BackendRegistryselects NCCL vs RCCL at runtime viacudaGetDeviceProperties.[x]Graceful degradation: if NCCL init fails, fall back to single-GPU or CPU backend.[x]Integration tests intests/acceleration/test_nccl_merge_topk.cpp(registered asNCCLMergeTopKFocusedTests): 13 CPU-side merge simulation tests + 6 NCCL single-rank device tests + 6 RCCL single-rank device tests.
Performance Targets:
- 100M × 128-dim index distributed across 4× A100 80GB; query latency < 15 ms @ 99th percentile for k=100.
- Linear scaling efficiency ≥ 75% from 1→4 GPUs.
Priority: Medium Target Version: v1.8.0 Status: ✅ IMPLEMENTED
For workloads that repeatedly execute the same ANN kernel shape (same dim, numQueries, topK), CUDA Graph capture eliminates kernel-launch overhead and CPU-side stream synchronisation [6]. CUDAGraphCache is implemented in cuda_backend.h/cuda_backend.cpp and captures/replays graphs keyed on {dim, numQueries, topK, metric}.
Implementation Notes:
[x]AddCUDAGraphCachestruct tocuda_backend.h/cuda_backend.cpp; keyed by aQueryShapetuple (numQueries,numVectors,dim,topK,metric), value is aCUDAGraphEntryowning acudaGraph_t+cudaGraphExec_tpair plus pre-allocated device buffers.[x]On cache miss: record a graph withcudaStreamBeginCapture/cudaStreamEndCaptureon a temporary non-blocking capture stream; instantiate viacudaGraphInstantiate(CUDA 11/12 API variant guarded byCUDART_VERSION).[x]On cache hit: copy new input data into the entry's pre-allocated device buffers viacudaMemcpyAsyncon the main stream, then replay withcudaGraphLaunch. Device-pointer addresses remain constant (pre-allocated at capture time) so no node-parameter update is required on every replay.[x]LRU evict graphs when cache exceeds 32 entries to bound device memory usage (CUDAGraphCache::evictLRUtraverses all entries in O(n) — acceptable since n ≤ 32).[x]Variable-shape batches: callers with variable-length batches are directed to usebatchKnnSearch()instead; documented in thebatchKnnSearchWithGraph()method comment.
Performance Targets:
- ≥ 30% reduction in end-to-end ANN query latency for fixed-shape repeated queries (benchmarked via
benchmarks/vector_bench.cpp). - Zero CUDA API error rate under 10-thread concurrent graph replay (validated by
tests/test_cuda_graph_capture.cpp).
Priority: High Target Version: v1.7.0 Status: ✅ IMPLEMENTED
BackendRegistry selects backends at startup by probing device capabilities (compute capability, available VRAM, driver version) through DeviceManager.
Implementation Notes:
[x]Createdevice_capability_probe.cpp/.h; exposeDeviceInfostruct withcomputeCapabilityMajor,computeCapabilityMinor,totalMemoryBytes,driverVersion,backendType. — implemented asdevice_manager.h/device_manager.cpp;DeviceCapabilityInfostruct incompute_backend.h[x]Probe order: CUDA → HIP → Vulkan → Metal → OpenCL → CPU. — delegated tothemis::gpu::DeviceDiscovery::Enumerate()which follows this order[x]Cache probe results for 60 s; re-probe on explicitBackendRegistry::refresh()call. —DeviceManager::probeDevices()caches forkCacheTTL = 60 s;DeviceManager::refresh()forces re-probe;BackendRegistry::initializeRuntime()callsDeviceManager::refresh()[x]Emit structured log line viautils/logger.hlisting selected backend and device name on startup. —DeviceManager::logDeviceInfo()emits structured output; called fromBackendRegistry::initializeRuntime()[x]Expose probe results viaBackendRegistry::deviceInfo()for observability. —BackendRegistry::deviceInfo()returns theDeviceCapabilityInfosnapshot captured atinitializeRuntime()time
Performance Targets:
- Probe completes in < 50 ms on a system with 4 GPUs.
- Zero false-positive backend selection failures in CI matrix covering CUDA 11.8, CUDA 12.x, ROCm 5.7, Vulkan 1.3.
| Test Type | Coverage Target | Notes |
|---|---|---|
| Unit | >80% new code | Mock cudaMemcpy / Vulkan dispatch via dependency-injected function pointers; test DeviceCapabilityProbe with mock device list |
| Integration | All backends registered and falling back to CPU when SDK absent | Run in CI with THEMIS_ENABLE_CUDA=OFF and THEMIS_ENABLE_VULKAN=OFF to validate CPU fallback path |
| Performance | Vector bench regression ≤ 5% | benchmarks/vector_bench.cpp; run on GPU runner; alert if p99 regresses |
| Thread-Safety | BackendRegistry concurrent access |
16-thread contention test; concurrent getBestVectorBackend() + registerBackend() writer (lightweight stub, no plugin I/O) — see BackendRegistry thread-safety feature above; run locally with -fsanitize=thread for data-race detection |
| Security | Plugin revocation (CRL/OCSP) | Fixture-based test with mock HTTP server; cover revoked, unknown, timeout, and invalid-signature paths |
| Edge-Cases | kMaxK overflow, k > 256 HNSW clamping |
tests/acceleration/test_cuda_hnsw_large_k.cpp; assert result count == requested k for k ∈ {257, 512, 1024} |
| Metric | Current | Target | Method |
|---|---|---|---|
| L2 search 1M×128 (CUDA) | Implemented (hardware benchmark required) | < 8 ms | benchmarks/vector_bench.cpp on RTX 3090 |
| Cosine search 500K×128 (Vulkan/MoltenVK) | < 20 ms ✅ | < 20 ms | Manual bench on M2 Pro |
| Multi-GPU scale-out efficiency | N/A | ≥ 75% (1→4× A100) | benchmarks/multi_gpu_bench.cpp |
| CUDA Graph replay latency reduction | ≥ 30% ✅ | ≥ 30% | benchmarks/vector_bench.cpp fixed-shape mode |
| Device probe latency (4-GPU system) | < 50 ms ✅ | < 50 ms | tests/acceleration/device_probe_test.cpp |
NCCL mergeTopK overhead (worldSize=4, k=100) |
Implemented (hardware benchmark required) | < 500 µs | benchmarks/multi_gpu_bench.cpp |
| INT8 matmul throughput vs FP16 | N/A (not implemented) | ≥ 2× on sm_86 | benchmarks/tensor_core_bench.cpp |
getStats() call latency (Linux) |
0 ms (returns 0) | < 2 ms | tests/acceleration/test_vllm_resource_stats.cpp |
| Block-dim occupancy gain (RDNA2) | baseline (256 fixed) | ≥ 5% throughput gain | benchmarks/kernel_block_size_bench.cpp |
[x]CRL/OCSP revocation enforced:PluginSecurityVerifier::checkCRL()andcheckOCSP()perform network-backed revocation checks with OpenSSL validation and serial-based caching;EnhancedPluginSecurityVerifier::verifyFullChain()applies both checks when policy requires revocation checks.[x]PE certificate extraction completed:EnhancedPluginSecurityVerifier::extractEmbeddedCertificate()parses PE security directory (IMAGE_DIRECTORY_ENTRY_SECURITY) and extracts PKCS#7 certificate blobs.[ ]plugin_security.cppsandbox must be applied to all dynamically loaded GPU backends (zluda_backend.cpp,oneapi_backend.cpp); verify symbol allow-list beforedlopen.[ ]GPU memory allocated viacudaMalloc/vkAllocateMemorymust be zeroed before exposing to query results to prevent information leakage between tenants.[x]vllm_resource_manager.cppcanUseGPU(): wrappedqueryGPUUtilization()withstd::async+wait_for(500ms); returnsfalseon timeout (safe CPU fallback). NVML hang no longer blocks the caller.[x]BackendRegistryshared mutable state (backends_,selectedVectorBackend_) now protected bystd::shared_mutex registryMutex_— data race fixed. See BackendRegistry: Thread-Safe Read Access above.
All planned features in this document are grounded in the following peer-reviewed research and industry specifications (IEEE format):
-
J. Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021, doi: 10.1109/TBDATA.2019.2921572. [Online]. Available: https://faiss.ai/ [Accessed: 2026-02-22] — Informs the FAISS GPU backend (
faiss_gpu_backend.cpp) and GPU vector indexing roadmap. -
Y. A. Malkov and D. A. Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, Apr. 2020, doi: 10.1109/TPAMI.2018.2889473. [Online]. Available: https://ieeexplore.ieee.org/document/8613833 [Accessed: 2026-02-22] — Informs GPU-accelerated HNSW kernel design (
cuda/cuda_hnsw_kernels.cu) and thekMaxKclamping issue. -
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, "FlashAttention: Fast and memory-efficient exact attention with IO-awareness," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022, pp. 16344–16359. [Online]. Available: https://arxiv.org/abs/2205.14135 [Accessed: 2026-02-22] — Informs IO-aware tiled kernel design for batch vector search and Tensor Core optimizations.
-
Y. Gao, K. Xiong, X. Gao, J. Ding, and C. D. Carothers, "NVIDIA Tensor Core for machine learning and deep learning," IEEE Micro, vol. 40, no. 6, pp. 33–45, Nov.–Dec. 2020, doi: 10.1109/MM.2020.3037720. [Online]. Available: https://ieeexplore.ieee.org/document/9269176 [Accessed: 2026-02-22] — Informs FP16/BF16/INT8 mixed-precision kernels in
cuda_backend.cppandtensor_core_matmul.cpp. -
C. Ding, A. Sharma, S. C. Suh, M. R. Amer, A. Bhattacharya, and S. Kumar, "ScaNN: Efficient vector similarity search at scale," in Proc. 37th Int. Conf. Machine Learning (ICML), 2020, pp. 2589–2599. [Online]. Available: https://arxiv.org/abs/1908.10396 [Accessed: 2026-02-22] — Informs quantization-aware ANN search and hybrid CPU/GPU search strategies.
-
Khronos Group, "Vulkan API Specification v1.3," Khronos Registries. [Online]. Available: https://www.khronos.org/registry/vulkan/ [Accessed: 2026-02-22] — Informs Vulkan compute shader pipeline and cross-platform GPU support (
vulkan_backend_full.cpp); including specialization constants for workgroup-size tuning. -
AMD, "ROCm documentation: Software platform for GPU computing," AMD. [Online]. Available: https://rocmdocs.amd.com/ [Accessed: 2026-02-22] — Informs HIP API usage, rocBLAS, and RCCL multi-GPU collectives (
hip_backend.cpp,rccl_vector_backend.cpp).
src/gpu/— Low-level GPU device discovery and driver wrappers used by the acceleration backends.src/geo/— Geospatial operators whose GPU path calls throughgeo_acceleration_bridge.cpp.src/graph/— Graph analytics engine; GPU-accelerated traversal delegates to backends registered here.src/index/— Vector index layer; callsIVectorBackend::batchKnnSearch()for GPU ANN search.src/performance/— Benchmarking infrastructure validating the ≥ 10× GPU speedup targets.include/acceleration/FUTURE_ENHANCEMENTS.md— Complementary enhancements to the public header interfaces.
Stubs:
src/acceleration/nccl_vector_backend.cpp—!THEMIS_ENABLE_NCCL: all collective ops return falsesrc/acceleration/rccl_vector_backend.cpp—!THEMIS_ENABLE_RCCL: all collective ops return false
Risk: Multi-GPU distributed ANN search (mergeTopK) and gradient allReduce unavailable; any multi-GPU training workload routes to CPU.
- NCCL (NVIDIA): Install NCCL library, set
-DTHEMIS_ENABLE_NCCL=1. Validate communicator init, multi-rankmergeTopKviancclAllGather,ncclBcast. - RCCL (AMD/ROCm): Install ROCm + RCCL, set
-DTHEMIS_ENABLE_RCCL=1. Mirror NCCL fix usingrcclAllGather+rcclBcast.
allReduceon 1 GB gradient tensor (4× A100 NVLink): ≤ 50 ms.mergeTopK(1M vectors, top-100, 4 GPUs): ≤ 2× single-GPU search latency.
- Multi-GPU integration: 2–4 GPU testbed; verify all-reduce gradient matches single-GPU sum within float32 tolerance.
mergeTopK: distributed result matches sequential merge of per-GPU top-K.
Stub: src/acceleration/opencl_backend.cpp — !THEMIS_ENABLE_OPENCL: computeDistances/batchKnnSearch return empty
Risk: Universal GPU support (AMD, Intel, Qualcomm, ARM Mali) via OpenCL 1.2+ unavailable; all queries fall through to CPU.
- Install OpenCL runtime (Intel, ROCm OpenCL, or CUDA OpenCL) and set
-DTHEMIS_ENABLE_OPENCL=1. - The existing OpenCL kernel (
openclKernelSourceL2 distance + KNN) is complete; only the build flag is missing.
batchKnnSearch(1M vectors, d=512, top-10, Intel Arc GPU): ≥ 3× CPU baseline throughput.
- Positive: initialized with valid OpenCL device →
batchKnnSearchreturns correct KNN results (parity test vs CPU reference). - Negative: no OpenCL device →
isAvailable()false; graceful fallback to CPU.
Stub: src/acceleration/oneapi_backend.cpp — !THEMIS_ENABLE_ONEAPI: stub class compiled; isAvailable() false
Risk: Intel Arc / Xe / XPU (SYCL/DPC++) GPU acceleration unavailable.
- Install Intel oneAPI Base Toolkit (DPC++ compiler + OpenCL runtime) and set
-DTHEMIS_ENABLE_ONEAPI=1. - The real SYCL
OneAPIVectorBackendclass (above#else) is complete; only the build flag is missing.
batchKnnSearch(1M vectors, d=512, top-10, Intel Arc A770): ≥ 3× CPU baseline throughput.
- Positive: build with oneAPI →
isAvailable()true →batchKnnSearchreturns correct KNN. - Negative: no Intel GPU → stub path →
isAvailable()false; no crash.
Stub: src/acceleration/ai_hardware_dispatcher.cpp — dispatchAppleANE() inside #ifdef THEMIS_HAS_NPU_APPLE: always returns success = false; Core ML session not created
Risk: Apple Neural Engine inference unavailable; ANE workloads route to CPU/GPU fallback; ANE power efficiency and throughput benefits lost.
- Link
metal_backend.mmwith Objective-C++ compiler (-x objective-c++). - Set
-DTHEMIS_HAS_NPU_APPLE=1in CMake. - Implement full Core ML path in
metal_backend.mm: createMLModelsession, prepareMLMultiArrayfromreq.input_data, run prediction, extract results. - Remove stub body and delegate
dispatchAppleANE()to the Obj-C++ implementation.
- ANE inference (INT8, 7B model): ≥ 3× throughput vs CPU (tokens/s); ≤ 30 % CPU overhead.
- Latency first token: ≤ 200 ms for a 7B INT8 model on M2/M3 ANE.
- macOS:
probeAppleANE()returns true →dispatchAppleANE()returnssuccess = trueandresult.tokensnon-empty. - Linux/Windows:
THEMIS_HAS_NPU_APPLEnot defined →dispatchAppleANE()not called; dispatcher routes to CPU/CUDA.
Stub: src/acceleration/plugin_security.cpp Mach-O path in extractEmbeddedSignature() — Mach-O magic detected but LC_CODE_SIGNATURE load commands not parsed; returns std::nullopt.
Risk: macOS plugin code signatures are never extracted; Apple code-signing verification is skipped for all macOS dylib/bundle plugins.
- Walk Mach-O load commands: iterate
mach_header.ncmds. - Locate
LC_CODE_SIGNATURE(cmd == 0x1D) and readlinkedit_data_command.dataoff+datasize. - Return the blob bytes; use downstream signature verifier to validate against the Apple codesign chain.
- Handle both 32-bit (
MH_MAGIC) and 64-bit (MH_MAGIC_64) and fat binary (FAT_MAGIC) Mach-O formats.
- Do not use
codesignsubprocess as the sole verification mechanism; parse the signature block directly to avoid TOCTOU and process-injection risks. - Validate the signature chain against the Apple root CA (or enterprise CA for internally signed bundles).