Name and Version
- llama.cpp commit: 2b2cd57
- Version: master branch (2026-04-11)
Operating systems
Modules affected
- libllama (core library)
- llama-server
Command line
llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
Problem description & steps to reproduce
- Run llama-server with NVIDIA Nemotron-3 Nano 4B (SSM-based model)
- Send a chat completion request — works fine
- Send a second chat completion request
- Server crashes during prompt cache serialization
Workaround: llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF --cache-ram 0 (disabling prompt cache fixes it)
Environment
- GPU: AMD Radeon RX 7900 XTX
- Vulkan: 1.4.341
- CPU: Ryzen 16-core
Relevant log output
The crash occurs in ggml-backend.cpp:348 during llama_kv_cache::state_write_data
Error: GGML_ASSERT(tensor->data != NULL && "tensor not allocated") failed
Backtrace shows the failure during prompt cache save after first response completes successfully.
Name and Version
Operating systems
Modules affected
Command line
Problem description & steps to reproduce
Workaround:
llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF --cache-ram 0(disabling prompt cache fixes it)Environment
Relevant log output
The crash occurs in
ggml-backend.cpp:348duringllama_kv_cache::state_write_dataError:
GGML_ASSERT(tensor->data != NULL && "tensor not allocated") failedBacktrace shows the failure during prompt cache save after first response completes successfully.