Skip to content

Misc. bug: Prompt cache serialization crashes on second request with SSM model on Vulkan #21762

@task24

Description

@task24

Name and Version

  • llama.cpp commit: 2b2cd57
  • Version: master branch (2026-04-11)

Operating systems

  • Linux

Modules affected

  • libllama (core library)
  • llama-server

Command line

llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF

Problem description & steps to reproduce

  1. Run llama-server with NVIDIA Nemotron-3 Nano 4B (SSM-based model)
  2. Send a chat completion request — works fine
  3. Send a second chat completion request
  4. Server crashes during prompt cache serialization

Workaround: llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF --cache-ram 0 (disabling prompt cache fixes it)

Environment

  • GPU: AMD Radeon RX 7900 XTX
  • Vulkan: 1.4.341
  • CPU: Ryzen 16-core

Relevant log output

The crash occurs in ggml-backend.cpp:348 during llama_kv_cache::state_write_data

Error: GGML_ASSERT(tensor->data != NULL && "tensor not allocated") failed

Backtrace shows the failure during prompt cache save after first response completes successfully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions