Skip to content
This repository was archived by the owner on Feb 10, 2026. It is now read-only.

jurisearch/EmbeddingAndRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbeddingAndRetrieval

Repository Overview

The goal of this repository is to provide all the tools to embed and retrieve from a vector database in Postgres.

The documents to be embedded must already be in the Postgres database and at the moment the table names can't be changed and are aligned with the naming in LegifranceAPI. It is advisable to run the embedding script on a schedule to continuously generate embeddings for new documents added to the database. Just verify that two processes are not running simultaneousl (this usage is untested). The retrieval can instead be run by simply importing and calling the function in retrieve.py.

Repository Structure

This repository contains the following files and directories:

  • embed.py: Python script used to generate the embeddings for new documents in the database.
  • retrieve.py: Python script providing a single function that can be called to query and retrieve from vector database.
  • embedding_and_retrieval: Directory collecting all additional Python scripts and custom modules needed to run the tools.
  • .gitignore: Gitignore file for this repository.
  • requirements.txt: PIP requirements file to install required packages.
  • README.md: The Readme file you are currently reading.

Getting Started

0) Python Environment

The Python enviroment used for this repository was purposefully kept as simple as possible, with minimal dependencies.

An environment containing the required packages with compatible versions can be created as follows:

conda create -n embed_retrieve python==3.12.4
conda activate embed_retrieve
pip install -r requirements.txt

1) Setting environmental variables

The following environmental variables need to be set before the script can be run: POSTGRES_HOST, POSTGRES_DB_NAME, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_PORT.

Do not put these variables in a file that gets pushed to an online repository.

2) Generate embeddings

To run the Python script to make new embeddings, simply activate the correct conda environment and, from the same directory as the embed.py file run:

python embed.py

Line arguments can be provided to the above command to change the behaviour of the script. An exaustive list of them can be found by running python embed.py -h.

3) Run vector search

To run queries on the vector database, simply import the function retrieve within retrieve.py and call it with appropriate parameters.

from retrieve import retrieve
retrieve('The text of the query')

To get an explaination of what arguments the function accepts you can run help(retrieve) after importing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages