A lightweight, cross-platform desktop utility for configuring GGUF models and estimating VRAM usage.
LLAMA.CPP Manager is an extremely lightweight desktop application that helps you:
- Select and analyze local GGUF models
- Configure inference parameters (context length, GPU offload, batch size, etc.)
- Estimate VRAM and RAM usage before loading models
- Detect available GPU resources
- Save and restore configuration profiles
Key Features:
- β¨ Ultra-lightweight: <50 MB VRAM footprint
- π Fast startup and single-binary deployment
- π₯οΈ Cross-platform: Windows, macOS, Linux
- π¨ Clean Dear ImGui interface
- π§ Integrates with llama.cpp tools for accurate model analysis
- ποΈ NEW: Functional file browser for easy model selection
- π€ NEW: Export configurations to llama.cpp CLI commands
- π¨ NEW: Optional native file dialog support
- π NEW: Recent models list with per-model settings persistence
- πΎ NEW: Hybrid configuration storage for better organization
- llama.cpp: This utility requires llama.cpp to be installed on your system
- Download from: https://github.com/ggerganov/llama.cpp
- Ensure
llama-cli(ormain) is in your system PATH - Optionally,
llama-gguf-dumpfor detailed model inspection
- C++17 compatible compiler (GCC 8+, Clang 7+, MSVC 2019+)
- CMake 3.15 or higher
- SDL2
- OpenGL 3.0+
sudo apt-get update
sudo apt-get install build-essential cmake libsdl2-devbrew install cmake sdl2Install vcpkg and use it to install dependencies:
vcpkg install sdl2:x64-windows- Clone the repository:
cd ~/Developer
git clone https://github.com/takasurazeem/llama_cpp_manager.git
cd llama_cpp_manager- Download Dear ImGui (if not using as submodule):
mkdir -p extern
cd extern
git clone https://github.com/ocornut/imgui.git
cd ..- Build the project:
mkdir -p build
cd build
cmake ..
make -j$(nproc)On Windows with Visual Studio:
mkdir build
cd build
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release- Run the application:
./llama_cpp_manager# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build
make
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH="$PATH:$(pwd)"# Using Homebrew
brew install llama.cpp
# Or build from source
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make- Download prebuilt binaries from llama.cpp releases
- Or build with CMake:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release- Add the build directory to your system PATH
Run this command to verify llama.cpp is installed:
llama-cli --versionIf the utility can't find llama.cpp, you'll see a setup dialog where you can:
- Install llama.cpp and retry detection
- Specify a custom installation path
π Major Feature Update!
- Functional File Browser: Browse and select GGUF models with an intuitive UI
- CLI Command Export: Generate ready-to-use
llama-clicommands with your settings - Native File Dialogs (optional): Professional cross-platform file dialogs
See FEATURES.md for detailed information about new features.
-
Launch the application
- On first run, the app will detect llama.cpp installation
- If not found, you'll be prompted to install or specify the path
-
Select a GGUF model
- Click "Browse GGUF Model..."
- Navigate through directories to find your model
- Select a
.gguffile (e.g.,llama-2-7b-Q4_K_M.gguf) - Model info loads automatically!
- Or select from Recently Selected Models list
-
Configure parameters
- Context Length: Maximum context window (tokens)
- GPU Offload: Number of layers to offload to GPU
- CPU Threads: Thread pool size for CPU inference
- Batch Size: Evaluation batch size
-
Monitor VRAM estimates
- Real-time VRAM/RAM usage estimates
- Warning indicators if configuration exceeds available VRAM
-
Export to CLI (NEW!)
- Click "Export CLI Command" to generate llama.cpp command
- Copy to clipboard and run in terminal
-
Save configuration
- Check "Remember settings" to save per-model configurations
- Settings automatically restored when selecting model from recent list
Enable "Show advanced settings" to access:
- RoPE Frequency Base/Scale: For extended context
- Flash Attention: Experimental faster attention (may not work with all models)
- KV Cache Quantization: Reduce memory usage
- MoE Settings: For Mixtral and other MoE models
- Expert CPU Offloading: Force expert weights to CPU
| Parameter | Description | Default | Range |
|---|---|---|---|
| Context Length | Maximum sequence length | 2048 | 512 - 262144 |
| GPU Offload | Layers offloaded to GPU | 0 | 0 - model layers |
| CPU Threads | Thread pool size | 8 | 1 - 32 |
| Batch Size | Evaluation batch size | 512 | 32 - 2048 |
| Parameter | Description | Default |
|---|---|---|
| Offload KV Cache | Store KV cache on GPU | β |
| Keep Model in Memory | Don't unload between uses | β |
| Use mmap() | Memory-map model file | β |
- Flash Attention: Faster attention mechanism
- K/V Cache Quantization: Reduce cache memory
- Force Experts to CPU: For MoE models with limited VRAM
The utility estimates memory usage using:
Total VRAM = Model Weights + KV Cache + Input/Output Buffers + Overhead
Model Weights (GPU):
GPU Weights = Total Model Size Γ (GPU Layers / Total Layers)
KV Cache:
KV Cache = 2 Γ Layers Γ Context Length Γ Embedding Dim Γ Bytes Per Element
Overhead:
- CUDA: ~512 MB
- Metal: ~256 MB
- CPU-only: ~512 MB
llama_cpp_manager/
βββ CMakeLists.txt # Build configuration
βββ README.md # This file
βββ LICENSE # MIT License
βββ extern/ # External dependencies
β βββ imgui/ # Dear ImGui (submodule)
βββ include/ # Header files
β βββ gguf_reader.h # GGUF file parser
β βββ vram_estimator.h # Memory calculation
β βββ gpu_detector.h # GPU detection
β βββ llama_cpp_interface.h # llama.cpp integration
β βββ config_manager.h # Configuration I/O
β βββ file_browser.h # File selection
βββ src/ # Implementation files
β βββ main.cpp # Application entry & UI
β βββ gguf_reader.cpp
β βββ vram_estimator.cpp
β βββ gpu_detector.cpp
β βββ llama_cpp_interface.cpp
β βββ config_manager.cpp
β βββ file_browser.cpp
βββ build/ # Build output (generated)
The application automatically detects GPUs using:
- NVIDIA:
nvidia-smicommand - AMD:
rocm-smicommand (Linux) - Apple: Metal framework (macOS)
- Fallback: Vulkan device enumeration
If GPU detection fails, the app will default to CPU-only mode.
Settings are stored in a hybrid configuration system for better organization:
- Linux/macOS:
~/.config/llama_manager/config.json - Windows:
%USERPROFILE%\.config\llama_manager\config.json
Contains:
- Global default settings
- Recent models list (up to 10 models)
- Last selected model path
- Remember settings preference
- Linux/macOS:
~/.config/llama_manager/models/<hash>.json - Windows:
%USERPROFILE%\.config\llama_manager\models\<hash>.json
Each model gets its own configuration file (hash based on path):
- Model-specific inference settings
- Persists independently
- Automatically loaded when selecting from recent list
- Only saved when "Remember settings" is enabled
Example global config:
{
"context_length": 4096,
"gpu_layers": 32,
"batch_size": 512,
"threads": 10,
"offload_kv_cache": true,
"use_mmap": true,
"last_model_path": "/path/to/model.gguf",
"remember_settings": true,
"recent_models": [
"/path/to/model1.gguf",
"/path/to/model2.gguf"
]
}See MODEL_SETTINGS.md for detailed information about the configuration system.
- Ensure llama.cpp is installed and
llama-cliis in your PATH - Use the "Specify llama.cpp Location" option in the setup dialog
- Verify installation:
llama-cli --version
- Install GPU drivers (NVIDIA CUDA, AMD ROCm, etc.)
- Check GPU detection:
nvidia-smiorrocm-smi - Application will fall back to CPU mode
- Ensure all dependencies are installed
- Update CMake:
cmake --version(need 3.15+) - Check compiler version supports C++17
- Reduce context length
- Decrease GPU layer offload
- Enable KV cache quantization
- Use a more quantized model variant (e.g., Q4_K_M instead of Q8_0)
-
Model Selection
- Q4_K_M offers best quality/size balance
- Q8_0 for maximum quality (2Γ size)
- IQ2_XXS/IQ3_XXS for extreme compression
-
GPU Offload
- Offload as many layers as VRAM allows
- Leave 1-2 GB VRAM free for system/driver
-
Context Length
- Use minimum needed for your task
- KV cache grows linearly with context
-
Batch Size
- Larger = faster (more parallel processing)
- Smaller = lower memory usage
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - see LICENSE file for details.
- llama.cpp - Inference engine
- Dear ImGui - Immediate mode GUI
- SDL2 - Cross-platform window/input
- Issues: https://github.com/takasurazeem/llama_cpp_manager/issues
- Discussions: https://github.com/takasurazeem/llama_cpp_manager/discussions
- llama.cpp: https://github.com/ggerganov/llama.cpp
Note: This utility does not load or run models. It only analyzes model files and estimates resource requirements. Use llama.cpp directly for inference.