The goal of this repository is to provide all the tools to embed and retrieve from a vector database in Postgres.
The documents to be embedded must already be in the Postgres database and at the moment the table names can't be changed and are aligned with the naming in LegifranceAPI. It is advisable to run the embedding script on a schedule to continuously generate embeddings for new documents added to the database. Just verify that two processes are not running simultaneousl (this usage is untested). The retrieval can instead be run by simply importing and calling the function in retrieve.py.
This repository contains the following files and directories:
- embed.py: Python script used to generate the embeddings for new documents in the database.
- retrieve.py: Python script providing a single function that can be called to query and retrieve from vector database.
- embedding_and_retrieval: Directory collecting all additional Python scripts and custom modules needed to run the tools.
- .gitignore: Gitignore file for this repository.
- requirements.txt: PIP requirements file to install required packages.
- README.md: The Readme file you are currently reading.
The Python enviroment used for this repository was purposefully kept as simple as possible, with minimal dependencies.
An environment containing the required packages with compatible versions can be created as follows:
conda create -n embed_retrieve python==3.12.4
conda activate embed_retrieve
pip install -r requirements.txtThe following environmental variables need to be set before the script can be run: POSTGRES_HOST, POSTGRES_DB_NAME, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_PORT.
Do not put these variables in a file that gets pushed to an online repository.
To run the Python script to make new embeddings, simply activate the correct conda environment and, from the same directory as the embed.py file run:
python embed.pyLine arguments can be provided to the above command to change the behaviour of the script. An exaustive list of them can be found by running python embed.py -h.
To run queries on the vector database, simply import the function retrieve within retrieve.py and call it with appropriate parameters.
from retrieve import retrieve
retrieve('The text of the query')To get an explaination of what arguments the function accepts you can run help(retrieve) after importing.