Skip to content

LidaDavydova/DLS_LowRetrieval

Repository files navigation

Lawyer bot

This is a Telegram bot that can find laws related to the user's situation described in natural language. It can also explain found laws in a simpler language. Who needs it?

  • For citizens without legal education
  • For practicing lawyers for quick navigation
  • For internal customer support services in the government/fintech sectors

Performace

Using a law from the dataset as a query:

  • Presicion@10: 0.643
  • Recall@10: 0.002
  • Hits@10: 0.947
  • MRR: 0.769
  • NDCG@10: 0.807
  • MAP@10: 0.745

Using test queries generated by ChatGPT + manually filtered

  • Presicion@10: 0.028
  • Recall@10: 0.005
  • Hits@10: 0.080
  • MRR: 0.035
  • NDCG@10: 0.045
  • MAP@10: 0.032

These metrics suggest that we have a lot of room for improvement. One possible option could be to change a dataset to a smaller one, only consisting of constitution and federal laws.

How to run

  1. Clone the repository
  2. Run "pip install -r requirements.txt" in the downloaded folder folder
  3. Create ".env" file in the tg_bot folder with "TOKEN=<telegram_bot_token>". One of the ways to obtain it is to use @BotFather in Telegram
  4. Run bot.py and you will be able to interact with your Telegram bot

Demo

Technical details

Dataset used: https://github.com/irlcode/RusLawOD (currently cut to 50k samples)

We first make embeddings for every law in the dataset (after lemmalizing it).

Then for each query the following happens. The system searches for the 50 nearest documents using FAISS (we use FAISS IndexFlatIP (exact search for inner product) with prior vector normalization. This makes the metric equivalent to cosine similarity, retrieving their indices and similarity scores. Then, non-existent entries and documents without classification are filtered out. The results are grouped by category (document class).

If there are 5 or more categories, the most relevant document from each is taken, and the top 5 groups make it to the final selection. If there are fewer categories, the top 2 documents from each are selected, sorted by similarity, and the top 5 are kept. This approach ensures a balance between accuracy and answer diversity.

  • Frontend: Telegram bot on aiogram
  • Backend: Built in logic inside TG bot
  • Retrieval: FAISS database + rubert-tiny2 model
  • LLM Generation: ollama with gemma3 / saiga model
  • Storage: Local parsquet files + FAISS index

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors