Skip to content

geoffmunn/llama.cpp

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9,052 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

llama.cpp β€” HIFI Quantisation Fork

License: MIT

This is a fork of the ggml-org/llama.cpp project, focused on developing custom quantisation types β€” currently the HIFI family of quantisation variants.

The HIFI quantisation types aim to deliver better quality at the same (or similar) model sizes compared to the standard quantisation options. This is an ongoing, actively developed project and public contributions are welcome.

Upstream llama.cpp is LLM inference in C/C++. Highlights from upstream:

Recent API changes

Hot topics


Quick start

To build and use HIFI quantised models, follow the detailed instructions in the HIFI Build Guide, which covers:

  • Cloning and building this fork
  • Downloading and converting base models
  • Creating imatrix files
  • Quantising models with the HIFI types
  • Running perplexity tests and benchmarks

About llama.cpp

The upstream llama.cpp project enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware β€” locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen β€” optimised via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2, AVX512 and AMX support for x86 architectures
  • RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantisation for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

For the full upstream project, see ggml-org/llama.cpp.

Supported models

Typically finetunes of the base models below are supported as well.

Text-only

Multimodal

Bindings
UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools
  • akx/ggify – download PyTorch models from Hugging Face Hub and convert them to GGML
  • akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
  • crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
  • gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
  • Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
  • unslothai/unsloth – πŸ¦₯ exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
Infrastructure
  • Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
  • GPUStack - Manage GPU clusters for running LLMs
  • llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
  • llama-swap - transparent proxy that adds automatic model switching with llama-server
  • Kalavai - Crowdsource end to end LLM deployment at any scale
  • llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
  • LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support
Games
  • Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
SYCL Intel and Nvidia GPU
OpenVINO [In Progress] Intel CPUs, GPUs, and NPUs
MUSA Moore Threads GPU
CUDA Nvidia GPU
HIP AMD GPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU
IBM zDNN IBM Z & LinuxONE
WebGPU [In Progress] All
RPC All
Hexagon [In Progress] Snapdragon
VirtGPU VirtGPU APIR

Key tools

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

With a local GGUF file:

llama-cli -m model.gguf

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-server -m model.gguf --port 8080

A tool for measuring the perplexity of a model over a given text β€” essential for evaluating quantisation quality.

llama-perplexity -m model.gguf -f file.txt

Benchmark the performance of inference for various parameters.

llama-bench -m model.gguf

Contributing

This is an ongoing project and public contributions are welcome. Whether it's new quantisation types, performance improvements, bug fixes, or documentation β€” all contributions are appreciated.

  • Open a PR or issue on this repository
  • See CONTRIBUTING.md for general guidelines (inherited from upstream)
  • Read the HIFI Build Guide to get familiar with the project workflow

Upstream documentation

This fork inherits extensive documentation from the upstream project:

Dependencies

  • yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
  • stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
  • nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
  • miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
  • subprocess.h - Single-header process launching solution for C and C++ - Public domain

About

LLM inference in C/C++

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 56.8%
  • C 13.0%
  • Python 7.5%
  • Cuda 6.2%
  • HTML 3.0%
  • TypeScript 2.9%
  • Other 10.6%