-
Notifications
You must be signed in to change notification settings - Fork 4
new quantization implementation #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -67,27 +67,38 @@ When `USING MODEL` is omitted, the collection uses the **default embedding model | |
|
|
||
| ## Quantization — QUANTIZE clause | ||
|
|
||
| Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all three Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`. | ||
| Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all four Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`. | ||
|
|
||
| **Three strategies:** | ||
| **Four strategies:** | ||
|
|
||
| | Type | Compression | Accuracy Loss | Best For | | ||
| | Type | Compression | Accuracy | Best For | | ||
| |---|---|---|---| | ||
| | `SCALAR` | 4× (float32 → int8) | < 1% | Most collections — best balance | | ||
| | `BINARY` | 32× (float32 → 1-bit) | Higher | High-dimensional vectors (768+), speed priority | | ||
| | `SCALAR` | 4× (float32 → int8) | < 1% loss | Most collections — best balance | | ||
| | `TURBO` | 8–32× (4-bit to 1-bit) | Low–medium | Better recall than BINARY at same storage budget | | ||
| | `BINARY` | 32× (float32 → 1-bit) | Higher loss | Speed priority; centered distributions only | | ||
| | `PRODUCT` | 4× (configurable) | Variable | Memory-constrained deployments | | ||
|
|
||
| **Full syntax:** | ||
| ``` | ||
| CREATE COLLECTION <name> ... QUANTIZE SCALAR [QUANTILE <0.0–1.0>] [ALWAYS RAM] | ||
| CREATE COLLECTION <name> ... QUANTIZE TURBO [BITS <1|1.5|2|4>] [ALWAYS RAM] | ||
| CREATE COLLECTION <name> ... QUANTIZE BINARY [ALWAYS RAM] | ||
| CREATE COLLECTION <name> ... QUANTIZE PRODUCT [ALWAYS RAM] | ||
| ``` | ||
|
|
||
| - **`QUANTILE <float>`** — (scalar only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted. | ||
| - **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all three quantization types. | ||
| - **`QUANTILE <float>`** — (SCALAR only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted. | ||
| - **`BITS <depth>`** — (TURBO only) bit depth passed to the Qdrant SDK: | ||
| - `4` — 4-bit (default when `BITS` is omitted; server applies its own default) | ||
| - `2` — 2-bit | ||
| - `1.5` — 1.5-bit | ||
| - `1` — 1-bit | ||
| > Compression ratios (8×, 16×, 24×, 32×) and recall characteristics are | ||
| > Qdrant server-side behaviors. QQL maps the `BITS` value to the SDK model and | ||
| > passes it to Qdrant; actual results depend on your Qdrant server version. | ||
| - **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all four quantization types. | ||
| - **`QUANTIZE`** always appears **after** all other clauses (`HYBRID`, `USING MODEL`, etc.). | ||
| - For `PRODUCT`, the compression ratio is fixed at **4×** in this version. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. “For PRODUCT, the compression ratio is fixed at 4× in this version” is implementation-specific and matches the current executor, but for TURBO the docs are making stronger behavioral claims than the QQL layer actually enforces. QQL only maps the user input to the SDK model here; the runtime behavior still depends on Qdrant. I’d keep the docs precise about syntax/support and avoid overcommitting on engine behavior unless we intend to validate those guarantees end-to-end. |
||
| - For `TURBO`, Cosine, Dot, and Euclidean distance are supported by the Qdrant server when TurboQuant is enabled. | ||
| - When used with `HYBRID` collections, quantization applies only to the **dense** vector. | ||
|
|
||
| **Examples:** | ||
|
|
@@ -102,6 +113,26 @@ Scalar with explicit calibration and quantized vectors pinned to RAM: | |
| CREATE COLLECTION research_papers QUANTIZE SCALAR QUANTILE 0.95 ALWAYS RAM | ||
| ``` | ||
|
|
||
| TurboQuant — default 4-bit (8× compression, good recall): | ||
| ```sql | ||
| CREATE COLLECTION research_papers QUANTIZE TURBO | ||
| ``` | ||
|
|
||
| TurboQuant — 2-bit (16× compression): | ||
| ```sql | ||
| CREATE COLLECTION research_papers QUANTIZE TURBO BITS 2 | ||
| ``` | ||
|
|
||
| TurboQuant — 1.5-bit (24× compression) with quantized vectors pinned to RAM: | ||
| ```sql | ||
| CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1.5 ALWAYS RAM | ||
| ``` | ||
|
|
||
| TurboQuant — 1-bit (32× compression, same ratio as BINARY but better recall): | ||
| ```sql | ||
| CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1 | ||
| ``` | ||
|
|
||
| Binary quantization for large high-dimensional embeddings: | ||
| ```sql | ||
| CREATE COLLECTION research_papers QUANTIZE BINARY | ||
|
|
@@ -115,22 +146,29 @@ CREATE COLLECTION research_papers QUANTIZE PRODUCT ALWAYS RAM | |
| Combined with hybrid collection: | ||
| ```sql | ||
| CREATE COLLECTION research_papers HYBRID QUANTIZE SCALAR | ||
| CREATE COLLECTION research_papers HYBRID QUANTIZE TURBO BITS 2 | ||
| ``` | ||
|
|
||
| Combined with a pinned model: | ||
| ```sql | ||
| CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE SCALAR QUANTILE 0.99 | ||
| CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO BITS 2 | ||
| ``` | ||
|
|
||
| Combined with hybrid + dense model: | ||
| ```sql | ||
| CREATE COLLECTION research_papers USING HYBRID DENSE MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO | ||
| ``` | ||
|
|
||
| **Valid combinations:** | ||
|
|
||
| | Base form | + QUANTIZE SCALAR | + QUANTIZE BINARY | + QUANTIZE PRODUCT | | ||
| |---|---|---|---| | ||
| | `CREATE COLLECTION name` | ✓ | ✓ | ✓ | | ||
| | `... HYBRID` | ✓ | ✓ | ✓ | | ||
| | `... USING MODEL 'x'` | ✓ | ✓ | ✓ | | ||
| | `... USING HYBRID` | ✓ | ✓ | ✓ | | ||
| | `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ | | ||
| | Base form | + SCALAR | + TURBO | + BINARY | + PRODUCT | | ||
| |---|---|---|---|---| | ||
| | `CREATE COLLECTION name` | ✓ | ✓ | ✓ | ✓ | | ||
| | `... HYBRID` | ✓ | ✓ | ✓ | ✓ | | ||
| | `... USING MODEL 'x'` | ✓ | ✓ | ✓ | ✓ | | ||
| | `... USING HYBRID` | ✓ | ✓ | ✓ | ✓ | | ||
| | `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ | ✓ | | ||
|
|
||
| > INSERT and SEARCH on quantized collections work exactly the same as on non-quantized ones — no changes to INSERT or SEARCH syntax are needed. | ||
|
|
||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Package metadata still advertises quantization support as scalar, binary, product. Since this PR adds TURBO, the published package description should be updated too so PyPI metadata stays aligned with the actual feature set. |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 10: The README intro still says quantization support is scalar, binary, product, but this PR adds TURBO as a fourth option. The top-level feature summary should be updated so the repo landing page and package description reflect the actual supported set.
line 87: still describes quantization as scalar/binary/product. Since TURBO is now supported and documented elsewhere in the PR, this should be updated to avoid inconsistency between the README summary and the actual syntax/examples below.