Skip to content

UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21597-sycl-fix-multigpu-ram
Open

UPSTREAM PR #21597: SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#1340
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21597-sycl-fix-multigpu-ram

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 8, 2026

Note

Source pull request: ggml-org/llama.cpp#21597

Summary

  • Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend
  • Replace sycl::free with zeMemFree for corresponding deallocations
  • Replace host-staged dev2dev_memcpy with direct Level Zero cross-device copy
  • Link against ze_loader for Level Zero API access
  • All changes include automatic fallback to original SYCL path if Level Zero is unavailable

Problem

On Intel multi-GPU systems, sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM export path (xe_gem_prime_export -> ttm_pool_alloc_page), which creates a 1:1 mirror of every VRAM allocation in system RAM. This causes system RAM to scale linearly with total VRAM allocated across GPUs, leading to OOM crashes during multi-GPU inference even when models fit entirely in VRAM.

Measured on dual Intel Arc Pro B70 (32GB each, 64GB total VRAM) with 64GB system RAM:

  • sycl::malloc_device 4 GiB = +4,112 MiB system RAM (1:1 mirror)
  • zeMemAllocDevice 4 GiB = +8 MiB system RAM (no mirror)

A 15.6 GiB Q4_K_M model consumed 60 GiB of system RAM during dual-GPU inference with sycl::malloc_device, causing repeated OOM crashes.

Solution

zeMemAllocDevice allocates GPU memory through Level Zero's SVM/P2P path instead of the DMA-buf/TTM path, avoiding the host memory staging entirely. SYCL kernels can read zeMemAllocDevice pointers with full interop, no compatibility issues.

Changes:

  • New ggml_sycl_malloc_device() / ggml_sycl_free_device() helpers that try Level Zero first, fall back to SYCL
  • Replaced 3 allocation sites: single-device buffer, split buffer, memory pool
  • Replaced 3 deallocation sites: buffer destructor, pool destructor, pool overflow
  • Updated dpct_malloc helper with same Level Zero path
  • Updated release_extra_gpu with zeMemFree
  • Updated dev2dev_memcpy to use zeCommandListAppendMemoryCopy for direct cross-device transfers

Test results

Dual Intel Arc Pro B70 (32GB each), AMD Ryzen 5 9600X, 64GB DDR5, Ubuntu 26.04, kernel 7.0, compute-runtime 26.09. Model: Qwen3.5-27B.

Q4_K_M, 48K context, dual GPU (-sm layer):

Metric Before After
Peak system RAM 60,034 MiB (100%), OOM crash ~6.7 GiB (10%), flat
pp48000 OOM crash 782 t/s
pp512 348 t/s 359 t/s
tg128 17.92 t/s 17.82 t/s

Q8_0, 32K context, dual GPU: 915 t/s, system RAM flat.

Single GPU: No regression. 467 t/s pp512, 17.12 tg128.

Correctness: Output is byte-for-byte identical between single and dual GPU with same seed (verified Q4_K_M, Q6_K).

Test plan

  • Single GPU inference (no regression)
  • Dual GPU pp512/tg128 (Q4_K_M, Q6_K, Q8_0)
  • Dual GPU large context (48K Q4_K_M, 48K Q6_K, 32K Q8_0)
  • System RAM stays flat during all dual-GPU tests
  • Correctness: single vs dual GPU output matches with fixed seed
  • Clean exit (no crash during cleanup/teardown)
  • Fallback path: builds and works without Level Zero

…ions

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 8, 2026

No meaningful performance changes were detected across 125247 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-tokenize, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants