EmbeddingAndRetrieval

Repository Overview

The goal of this repository is to provide all the tools to embed and retrieve from a vector database in Postgres.

The documents to be embedded must already be in the Postgres database and at the moment the table names can't be changed and are aligned with the naming in LegifranceAPI. It is advisable to run the embedding script on a schedule to continuously generate embeddings for new documents added to the database. Just verify that two processes are not running simultaneousl (this usage is untested). The retrieval can instead be run by simply importing and calling the function in retrieve.py.

Repository Structure

This repository contains the following files and directories:

embed.py: Python script used to generate the embeddings for new documents in the database.
retrieve.py: Python script providing a single function that can be called to query and retrieve from vector database.
embedding_and_retrieval: Directory collecting all additional Python scripts and custom modules needed to run the tools.
.gitignore: Gitignore file for this repository.
requirements.txt: PIP requirements file to install required packages.
README.md: The Readme file you are currently reading.

Getting Started

0) Python Environment

The Python enviroment used for this repository was purposefully kept as simple as possible, with minimal dependencies.

An environment containing the required packages with compatible versions can be created as follows:

conda create -n embed_retrieve python==3.12.4
conda activate embed_retrieve
pip install -r requirements.txt

1) Setting environmental variables

The following environmental variables need to be set before the script can be run: POSTGRES_HOST, POSTGRES_DB_NAME, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_PORT.

Do not put these variables in a file that gets pushed to an online repository.

2) Generate embeddings

To run the Python script to make new embeddings, simply activate the correct conda environment and, from the same directory as the embed.py file run:

python embed.py

Line arguments can be provided to the above command to change the behaviour of the script. An exaustive list of them can be found by running python embed.py -h.

3) Run vector search

To run queries on the vector database, simply import the function retrieve within retrieve.py and call it with appropriate parameters.

from retrieve import retrieve
retrieve('The text of the query')

To get an explaination of what arguments the function accepts you can run help(retrieve) after importing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbeddingAndRetrieval

Repository Overview

Repository Structure

Getting Started

0) Python Environment

1) Setting environmental variables

2) Generate embeddings

3) Run vector search

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
embedding_and_retrieval		embedding_and_retrieval
.gitignore		.gitignore
README.md		README.md
embed.py		embed.py
requirements.txt		requirements.txt
retrieve.py		retrieve.py

Folders and files

Latest commit

History

Repository files navigation

EmbeddingAndRetrieval

Repository Overview

Repository Structure

Getting Started

0) Python Environment

1) Setting environmental variables

2) Generate embeddings

3) Run vector search

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages