Skip to content

Commit 251a02d

Browse files
unamedkrclaude
andcommitted
Add Python bindings: from quantcpp import Model
bindings/python/quantcpp/ — ctypes wrapper for libturboquant: - Model("model.gguf") — load GGUF model - model.ask("question") — one-shot generation - model.chat("question") — auto chat template - model.generate("prompt") — streaming iterator - KV compression support Install: cd bindings/python && pip install -e . Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 279f235 commit 251a02d

4 files changed

Lines changed: 817 additions & 16 deletions

File tree

bindings/python/README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# quantcpp
2+
3+
Python bindings for [quant.cpp](https://github.com/hunscompany/quant.cpp) -- a minimal C inference engine for local LLMs with KV cache compression.
4+
5+
## Installation
6+
7+
```bash
8+
# Build the shared library first
9+
cd /path/to/quant.cpp
10+
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON
11+
cmake --build build -j$(nproc)
12+
13+
# Install the Python package
14+
cd bindings/python
15+
pip install .
16+
```
17+
18+
Or set the library path explicitly:
19+
20+
```bash
21+
export TURBOQUANT_LIB=/path/to/build/libturboquant.dylib
22+
pip install .
23+
```
24+
25+
## Usage
26+
27+
### Basic generation
28+
29+
```python
30+
from quantcpp import Model
31+
32+
m = Model("model.gguf")
33+
text = m.ask("What is 2+2?")
34+
print(text)
35+
```
36+
37+
### KV cache compression
38+
39+
```python
40+
m = Model("model.gguf", kv_compress="4bit")
41+
text = m.ask("Explain quantum computing")
42+
```
43+
44+
Available compression modes: `"1bit"`, `"2bit"`, `"3bit"` (default), `"4bit"`, `"polar3"`, `"polar4"`, `"qjl"`, `"turbo3"`, `"turbo4"`, `"uniform2"`, `"uniform3"`, `"uniform4"`, `"none"`.
45+
46+
### Streaming
47+
48+
```python
49+
for token in m.generate("Once upon a time"):
50+
print(token, end="", flush=True)
51+
```
52+
53+
### Chat mode
54+
55+
```python
56+
text = m.chat("What is the capital of France?")
57+
print(text)
58+
```
59+
60+
### Raw completion (no chat template)
61+
62+
```python
63+
text = m.complete("The quick brown fox", max_tokens=64)
64+
print(text)
65+
```
66+
67+
### Context manager
68+
69+
```python
70+
with Model("model.gguf") as m:
71+
print(m.ask("Hello!"))
72+
```
73+
74+
## API Reference
75+
76+
### `Model(path, kv_compress=None, n_threads=0)`
77+
78+
Load a GGUF or TQM model file.
79+
80+
- `path` -- Path to model file.
81+
- `kv_compress` -- KV cache compression mode (see above).
82+
- `n_threads` -- CPU thread count (0 = auto).
83+
84+
### `Model.ask(prompt, *, max_tokens=512, temperature=0.6, top_p=0.9)`
85+
86+
Chat-formatted generation. Returns the full response string.
87+
88+
### `Model.chat(message, **kwargs)`
89+
90+
Alias for `ask()`.
91+
92+
### `Model.generate(prompt, *, max_tokens=512, temperature=0.6, top_p=0.9, chat=False)`
93+
94+
Streaming generation. Yields token strings. Set `chat=True` to apply chat template.
95+
96+
### `Model.complete(prompt, **kwargs)`
97+
98+
Raw text completion without chat template.
99+
100+
### `Model.close()`
101+
102+
Release model resources. Called automatically on garbage collection or context exit.
103+
104+
## Library search order
105+
106+
1. `TURBOQUANT_LIB` environment variable
107+
2. Same directory as the Python package
108+
3. `build/` relative to the project root
109+
4. System library path

0 commit comments

Comments
 (0)