In this this work we implement Embedding Quantization. This is a great technique for significantly faster and cheaper retrieval. We go through the step by step procedure of qunatizing embedding along with conceptual explanation and implementation.
All the explanations and Codes are reproduced based on the Article from Hugging Face.
To implement a RAG System, we will require some kind of retrieval system. Thus there has to be some database to retrieve from. In LLM world this is a Vector Database. This database is different from normal SQL database in the sense that a Vector Databse store embedding of some dimension and for retrieval purpose we need to perform heavy computation on the databse to generate similarity score.
Vector Database are costly becasuse it is both Memory hungry and Computaion hungry. That means the embedding has to be present
on primary memory for computaion and computation is done on all the embeddings for similarity scoring.
We need a sophisticated technique to bring down computaion load and memory load while preserving the accuray. Here comes Embedding Quantization.
We will take our example implemetation to make comarison with normal embedding.
- retrieve_dataset.ipynb: Retrieval
CoNaLadataset. - create_mbedding-vecotor.ipynb: Create
f32embedding vectors. - binary_index.py: Perform Binary Quantization and create faiss vector database.
- save_int8_index.py: Perform Scalar Quantization and create USearch database.
- query_search.ipynb: Perform Query retrieval.
- app.py: Gradio App Hosted on Hugging Face Gradio Space. Embedding Quantization
- faiss: Store binary quantization embedding
- USearch: Store scalar (int8) quantization
- all-MiniLM-L6-v2: Embedding Model
- Embedding Dimension: 384
- Databse Size:
5,93,891
-
No of Index = 593891
-
Embedding dim = 384
-
dtype = f32, each diension is of 32 bits
Memory usage =
913 MB(Rounded)
To generate Binary Quantization from a float32 embedding we simply threshold quantize each dimension at 0. It simple means f(x)=0 if x<=0 else 1. We store the Binary Quantized Embedding in Vector DB. To perform retrieval we convert user query to binary quantized embedding.
Then we use Hamming Distance between qury embedding and Vector DB embeddings to perform similarity check on the Vector Database. Hamming distance
is the number of bits by with two embeddings differ. Lowe the Hamming Distance more is the document relevant and hence higher similarity. On average
Binary Quantization gives 24.76x speed up and exactly 32x memory saving.
-
No of Index = 593891
-
Embedding dim = 384
-
dtype = bit, each diension is of 1 bit
Memory usage =
29 MB(Rounded). It is32Xlower memory requirement thanfloat32
We have a way to retrieve similar documents with humming distance as similarity measure for Binary Quantization. Though it speed ups the
retrieval process but preserve roughly 92.5% retrieval performance.
We use a technique called rescoring intoduced in the paper to preserve alomost 96% retrieval performance.
In rescoring techinque we first retrive top_k*rescore multiplier document i.e. rescore multiplier x times more documents than require.
Then we perform dot prduct between their embeddings (binary) and query (f32) embedding to calculate similarity scores and return top_k according to
the new similarity score.
This is another type of quantization to improve retrieval performance. Here instead of binary quantization we convert the f32 embedding into
int8 embeddings or unit8 embeddings. Scalar quantization reduce memory requirement by 4x as compared to 8x in Binary Quantization but it has
a retrieval performance of more than 99% with a rescore multiplier of 10. On average Scalar Quantization gives 3.77x speed up and exactly 4x memory saving.
To benifit from the memory and computation requiremnt of Binary Quantization and retrieval performance of Scalar Quantization, in practice we
use both the technique together.
We use Binary Quantization for the actual Vetor DB for in memory compuation. Separately we store the Scalar Quantization in disk space. We store in
disk space bcause we do not perfrom any compuation on this database. Firt we perform retrieval with Binary Quantization which is computation heavy,
then we perform rescoring with Scalar Quantization and return top_k documents.
The App is Hosted on Hugging Face Gradio Space. Embedding Quantization
#Acknowledgement Thanks to Hugging Face Team for the in depth explanation on Embedding Quantization. Article
