Skip to content

Misc. bug: Higher VRAM usage after b8738 #21759

@EldarBorge

Description

@EldarBorge

Name and Version

b8738 and later

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library), llama-server

Command line

./llama-server \                                                                                                                                                                            
    --alias local-ai \                                                                                                                                                                        
    -m /models/google_gemma-4-31B-it-Q4_K_S.gguf \                                                                                                                                            
    --host 0.0.0.0 \                                                                                                                                                                          
    --port 8080 \                                                                                                                                                                             
    -np 1 \                                                                                                                                                                                   
    -ngl 99 \                                                                                                                                                                                 
    -c 131072 \                                                                                                                                                                               
    -n 16000 \                                                                                                                                                                                
    --mmproj /models/mmproj-google_gemma-4-31B-it-bf16.gguf \                                                                                                                                 
    --cache-type-k q4_0 \                                                                                                                                                                     
    --cache-type-v q4_0 \                                                                                                                                                                     
    --cache-ram 2048 \                                                                                                                                                                        
    -ctxcp 2 \                                                                                                                                                                                
    --flash-attn on \                                                                                                                                                                         
    --jinja \                                                                                                                                                                                 
    --chat-template-file /templates/google-gemma-4-31B-it-                                                                                                                                    
  interleaved.jinja \                                                                                                                                                                         
    --no-prefill-assistant \                                                                                                                                                                  
    --chat-template-kwargs '{"enable_thinking":true}' \                                                                                                                                       
    --temp 1.0 \                                                                                                                                                                              
    --top-p 0.95 \                                                                                                                                                                            
    --top-k 64 \                                                                                                                                                                              
    --min-p 0.0 \                                                                                                                                                                             
    --presence-penalty 1.5 \                                                                                                                                                                  
    --repeat-penalty 1.0 \                                                                                                                                                                    
    --metrics

Problem description & steps to reproduce

On a single RTX 3090, pre b8738 has lower VRAM idle and under load than at b8738+.
Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.
A stress test with tuned full context works pre b8738, and fails on b8738+.

According to investigation by AI, the bad state is that ggml_cuda_init() initiatilzes NCCL communicators even on single GPU runs, which uses uneccessary VRAM. This should be fixed so it only initializes NCCL when there are more than one GPU.

Reproduce by running pre b8738 and measure idle VRAM, and then do the same on b8738+. At least with a single RTX 3090 on this setup.

First Bad Commit

d6f3030047f85a98b009189e76f441fe818ea44db8738

Relevant log output

Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions