UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340
Open
UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340
Conversation
…ions Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable.
|
No meaningful performance changes were detected across 125247 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-tokenize, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli. 💬 Questions? Tag @loci-dev |
e800934 to
a024d9c
Compare
7638ab4 to
f1b46d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#21597
Summary
sycl::malloc_devicewithzeMemAllocDevicefor GPU memory allocation in the SYCL backendsycl::freewithzeMemFreefor corresponding deallocationsdev2dev_memcpywith direct Level Zero cross-device copyze_loaderfor Level Zero API accessProblem
On Intel multi-GPU systems,
sycl::malloc_devicetriggers the xe kernel driver's DMA-buf/TTM export path (xe_gem_prime_export->ttm_pool_alloc_page), which creates a 1:1 mirror of every VRAM allocation in system RAM. This causes system RAM to scale linearly with total VRAM allocated across GPUs, leading to OOM crashes during multi-GPU inference even when models fit entirely in VRAM.Measured on dual Intel Arc Pro B70 (32GB each, 64GB total VRAM) with 64GB system RAM:
sycl::malloc_device4 GiB = +4,112 MiB system RAM (1:1 mirror)zeMemAllocDevice4 GiB = +8 MiB system RAM (no mirror)A 15.6 GiB Q4_K_M model consumed 60 GiB of system RAM during dual-GPU inference with
sycl::malloc_device, causing repeated OOM crashes.Solution
zeMemAllocDeviceallocates GPU memory through Level Zero's SVM/P2P path instead of the DMA-buf/TTM path, avoiding the host memory staging entirely. SYCL kernels can readzeMemAllocDevicepointers with full interop, no compatibility issues.Changes:
ggml_sycl_malloc_device()/ggml_sycl_free_device()helpers that try Level Zero first, fall back to SYCLdpct_mallochelper with same Level Zero pathrelease_extra_gpuwithzeMemFreedev2dev_memcpyto usezeCommandListAppendMemoryCopyfor direct cross-device transfersTest results
Dual Intel Arc Pro B70 (32GB each), AMD Ryzen 5 9600X, 64GB DDR5, Ubuntu 26.04, kernel 7.0, compute-runtime 26.09. Model: Qwen3.5-27B.
Q4_K_M, 48K context, dual GPU (
-sm layer):Q8_0, 32K context, dual GPU: 915 t/s, system RAM flat.
Single GPU: No regression. 467 t/s pp512, 17.12 tg128.
Correctness: Output is byte-for-byte identical between single and dual GPU with same seed (verified Q4_K_M, Q6_K).
Test plan