UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations by loci-dev · Pull Request #1340 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-08T03:10:38Z

Note

Source pull request: ggml-org/llama.cpp#21597

Summary

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend
Replace sycl::free with zeMemFree for corresponding deallocations
Replace host-staged dev2dev_memcpy with direct Level Zero cross-device copy
Link against ze_loader for Level Zero API access
All changes include automatic fallback to original SYCL path if Level Zero is unavailable

Problem

On Intel multi-GPU systems, sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM export path (xe_gem_prime_export -> ttm_pool_alloc_page), which creates a 1:1 mirror of every VRAM allocation in system RAM. This causes system RAM to scale linearly with total VRAM allocated across GPUs, leading to OOM crashes during multi-GPU inference even when models fit entirely in VRAM.

Measured on dual Intel Arc Pro B70 (32GB each, 64GB total VRAM) with 64GB system RAM:

sycl::malloc_device 4 GiB = +4,112 MiB system RAM (1:1 mirror)
zeMemAllocDevice 4 GiB = +8 MiB system RAM (no mirror)

A 15.6 GiB Q4_K_M model consumed 60 GiB of system RAM during dual-GPU inference with sycl::malloc_device, causing repeated OOM crashes.

Solution

zeMemAllocDevice allocates GPU memory through Level Zero's SVM/P2P path instead of the DMA-buf/TTM path, avoiding the host memory staging entirely. SYCL kernels can read zeMemAllocDevice pointers with full interop, no compatibility issues.

Changes:

New ggml_sycl_malloc_device() / ggml_sycl_free_device() helpers that try Level Zero first, fall back to SYCL
Replaced 3 allocation sites: single-device buffer, split buffer, memory pool
Replaced 3 deallocation sites: buffer destructor, pool destructor, pool overflow
Updated dpct_malloc helper with same Level Zero path
Updated release_extra_gpu with zeMemFree
Updated dev2dev_memcpy to use zeCommandListAppendMemoryCopy for direct cross-device transfers

Test results

Dual Intel Arc Pro B70 (32GB each), AMD Ryzen 5 9600X, 64GB DDR5, Ubuntu 26.04, kernel 7.0, compute-runtime 26.09. Model: Qwen3.5-27B.

Q4_K_M, 48K context, dual GPU (-sm layer):

Metric	Before	After
Peak system RAM	60,034 MiB (100%), OOM crash	~6.7 GiB (10%), flat
pp48000	OOM crash	782 t/s
pp512	348 t/s	359 t/s
tg128	17.92 t/s	17.82 t/s

Q8_0, 32K context, dual GPU: 915 t/s, system RAM flat.

Single GPU: No regression. 467 t/s pp512, 17.12 tg128.

Correctness: Output is byte-for-byte identical between single and dual GPU with same seed (verified Q4_K_M, Q6_K).

Test plan

Single GPU inference (no regression)
Dual GPU pp512/tg128 (Q4_K_M, Q6_K, Q8_0)
Dual GPU large context (48K Q4_K_M, 48K Q6_K, 32K Q8_0)
System RAM stays flat during all dual-GPU tests
Correctness: single vs dual GPU output matches with fixed seed
Clean exit (no crash during cleanup/teardown)
Fallback path: builds and works without Level Zero

…ions Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable.

loci-review · 2026-04-08T04:05:45Z

No meaningful performance changes were detected across 125247 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-tokenize, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to PROD__AL_DEMO April 8, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19

loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340

UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21597-sycl-fix-multigpu-ram

loci-dev commented Apr 8, 2026

Uh oh!

loci-review Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 8, 2026

Summary

Problem

Solution

Test results

Test plan

Uh oh!

loci-review Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants