llama.cpp — HIFI Quantisation Fork

This is a fork of the ggml-org/llama.cpp project, focused on developing custom quantisation types — currently the HIFI family of quantisation variants.

The HIFI quantisation types aim to deliver better quality at the same (or similar) model sizes compared to the standard quantisation options. This is an ongoing, actively developed project and public contributions are welcome.

Upstream llama.cpp is LLM inference in C/C++. Highlights from upstream:

Recent API changes

Hot topics

Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.
guide : using the new WebUI of llama.cpp
guide : running gpt-oss with llama.cpp
[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Multimodal support arrived in llama-server: #12898 | documentation
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org#9669
Hugging Face GGUF editor: discussion | tool

Quick start

To build and use HIFI quantised models, follow the detailed instructions in the HIFI Build Guide, which covers:

Cloning and building this fork
Downloading and converting base models
Creating imatrix files
Quantising models with the HIFI types
Running perplexity tests and benchmarks

About llama.cpp

The upstream llama.cpp project enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware — locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen — optimised via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantisation for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

For the full upstream project, see ggml-org/llama.cpp.

Supported models

Typically finetunes of the base models below are supported as well.

Text-only

Multimodal

Bindings

Python: ddh0/easy-llama
Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JS/TS (Programmable Prompt Engine CLI): offline-ai/cli
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
Rust (automated build from crates.io): ShelbyJenkins/llm_client
C#/.NET: SciSharp/LLamaSharp
C#/VB.NET (more features - community license): LM-Kit.NET
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Java: QuasarByte/llama-cpp-jna
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
Flutter: xuegao-tzx/Fllama
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
Guile Scheme: guile_llama_cpp
Swift srgtuszy/llama-cpp-swift
Swift ShenghaiWang/SwiftLlama
Delphi Embarcadero/llama-cpp-delphi
Go (no CGo needed): hybridgroup/yzma
Android: llama.android

UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

AI Sublime Text plugin (MIT)
BonzAI App (proprietary)
cztomsik/ava (MIT)
Dot (GPL)
eva (MIT)
iohub/collama (Apache-2.0)
janhq/jan (AGPL)
johnbean393/Sidekick (MIT)
KanTV (Apache-2.0)
KodiBot (GPL)
llama.vim (MIT)
LARS (AGPL)
Llama Assistant (GPL)
LlamaLib (Apache-2.0)
LLMFarm (MIT)
LLMUnity (MIT)
LMStudio (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
MindMac (proprietary)
MindWorkAI/AI-Studio (FSL-1.1-MIT)
Mobile-Artificial-Intelligence/maid (MIT)
Mozilla-Ocho/llamafile (Apache-2.0)
nat/openplayground (MIT)
nomic-ai/gpt4all (MIT)
ollama/ollama (MIT)
oobabooga/text-generation-webui (AGPL)
PocketPal AI (MIT)
psugihara/FreeChat (MIT)
ptsochantaris/emeltal (MIT)
pythops/tenere (AGPL)
ramalama (MIT)
semperai/amica (MIT)
withcatai/catai (MIT)
Autopen (GPL)

Tools

akx/ggify – download PyTorch models from Hugging Face Hub and convert them to GGML
akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
unslothai/unsloth – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)

Infrastructure

Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
GPUStack - Manage GPU clusters for running LLMs
llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
llama-swap - transparent proxy that adds automatic model switching with llama-server
Kalavai - Crowdsource end to end LLM deployment at any scale
llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support

Games

Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend	Target devices
Metal	Apple Silicon
BLAS	All
SYCL	Intel and Nvidia GPU
OpenVINO [In Progress]	Intel CPUs, GPUs, and NPUs
MUSA	Moore Threads GPU
CUDA	Nvidia GPU
HIP	AMD GPU
Vulkan	GPU
CANN	Ascend NPU
OpenCL	Adreno GPU
IBM zDNN	IBM Z & LinuxONE
WebGPU [In Progress]	All
RPC	All
Hexagon [In Progress]	Snapdragon
VirtGPU	VirtGPU APIR

Key tools

`llama-cli`

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

With a local GGUF file:

llama-cli -m model.gguf

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-server -m model.gguf --port 8080

`llama-perplexity`

A tool for measuring the perplexity of a model over a given text — essential for evaluating quantisation quality.

llama-perplexity -m model.gguf -f file.txt

`llama-bench`

Benchmark the performance of inference for various parameters.

llama-bench -m model.gguf

Contributing

This is an ongoing project and public contributions are welcome. Whether it's new quantisation types, performance improvements, bug fixes, or documentation — all contributions are appreciated.

Open a PR or issue on this repository
See CONTRIBUTING.md for general guidelines (inherited from upstream)
Read the HIFI Build Guide to get familiar with the project workflow

Upstream documentation

This fork inherits extensive documentation from the upstream project:

Dependencies

yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
subprocess.h - Single-header process launching solution for C and C++ - Public domain

Name		Name	Last commit message	Last commit date
Latest commit History 9,052 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
HIFI_BUILD_GUIDE.md		HIFI_BUILD_GUIDE.md
IMatrix_Guide.md		IMatrix_Guide.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
benchmark_speed_test.ps1		benchmark_speed_test.ps1
benchmark_speed_test.sh		benchmark_speed_test.sh
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp — HIFI Quantisation Fork

Recent API changes

Hot topics

Quick start

About llama.cpp

Text-only

Multimodal

Supported backends

Key tools

`llama-cli`

`llama-server`

`llama-perplexity`

`llama-bench`

Contributing

Upstream documentation

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp — HIFI Quantisation Fork

Recent API changes

Hot topics

Quick start

About llama.cpp

Text-only

Multimodal

Supported backends

Key tools

llama-cli

llama-server

llama-perplexity

llama-bench

Contributing

Upstream documentation

Dependencies

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`llama-cli`

`llama-server`

`llama-perplexity`

`llama-bench`

Packages