This repository provides the code for an automated pipeline to:
- Filter raw data
Select consultations/statements by page count, table density, language (DE), and federal level. - PDF → Markdown (DE only)
Extract text from PDFs (OCR fallback where needed) of consultations and statements in Markdown format. Detect the language of each page and only keep pages in German language. - Statement letter detection and splitting
Split multi-letter statement PDFs into individual letters (documents) using page markers, regex rules and structural cues. - Duplicate and Table removal
Detect near duplicate documents and omit those from the dataset. Detect documents with a large amount of tables and exclude those as well. - Chunking
Segment consultation and statement documents into semantically coherent chunks (token-bounded). - Embeddings
Encode all chunks with BAAI/bge-m3; build FAISS indices for efficient retrieval/matching. - Sentiment classification (local LLM, function calling)
For each statement chunk, classify
supports | rejects | neutral | no_comment | agrees_with_others
using Mistral-7B-Instruct v0.3 (local, GGUF via llama.cpp) with strict JSON/function calling. Use a rule-based approach to classify the general sentiment per document based on the sentiments per chunk. - Article/keyword extraction (local LLM, function calling)
Extract referenced Artikel / Absätze / Schlüsselwörter from both consultations and statements to inform mapping. - Statement → Consultation mapping
Match statement chunks to the most relevant consultation chunks via a hybrid scorer:
- semantic similarity (bge-m3 cosine),
- cross-encoder reranker (BAAI/bge-reranker-v2-m3),
- article/keyword alignment signals,
- aggregate to a final score.
- Evaluation
- Sampled test sets (classification and mapping) with manually curated ground truth → report precision/recall/F1 & mapping metrics.
- LLM-as-Judge (local Mistral, function calling) for scalable qualitative checks over the full dataset.
We use the public Hugging Face dataset demokratis/consultation-documents:
- Consultation Documents Features (https://huggingface.co/datasets/demokratis/consultation-documents/blob/main/consultation-documents-features.parquet)
- Consultation Documents Preprocessed (https://huggingface.co/datasets/demokratis/consultation-documents/blob/main/consultation-documents-preprocessed.parquet)
The python environment is managed with pipenv. You can set up your environment with the following steps:
- Run
pipenv lockto generate thePipfile.lockwhich lists the version of your python packages. - Run
pipenv install --devto actually create a virtual environment and install the python packages. The flag--devallows to install the development packages. - Run
pipenv shellto activate your python environment. - Run
pre-commit installto install pre-commits and make sure they are run at every commit.
- Embeddings: BAAI/bge-m3 (downloaded on first use)
- Reranker: BAAI/bge-reranker-v2-m3 (downloaded on first use)
- LLM (OpenAI-compatible server via llama.cpp):
Get a local GGUF of Mistral-7B-Instruct v0.3 (e.g. from MaziyarPanahi):
https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF
To run the LLM, set up llama.cpp:
llama.cpp is a lightweight, high-performance runtime for LLaMA-compatible models (e.g., Mistral, Llama 3, Qwen) using GGUF model files.
It supports CPU, Apple Metal, CUDA, and OpenCL.
brew update
brew install llama.cppVerify:
llama-cli --help
llama-server --helpxcode-select --install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_METAL=1 # GPU acceleration on Apple SiliconDownload a release from: https://github.com/ggerganov/llama.cpp/releases
Example:
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-linux-x64.zip
unzip llama-linux-x64.zip
sudo mv llama* /usr/local/bin/sudo apt update
sudo apt install -y build-essential cmake
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build .
cmake --build build --config ReleaseCUDA build:
cmake -B build -DLLAMA_CUDA=1 .
cmake --build build --config ReleaseDownload ZIP from: https://github.com/ggerganov/llama.cpp/releases
Unzip and run:
llama-cli.exe --helpgit clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -A x64 .
cmake --build build --config ReleaseExecutables will appear under:
build\bin\Release\
To run a model, download a .gguf file into a folder, for example:
models/Mistral-7B-Instruct-v0.3.Q6_K.gguf
Then launch the local OpenAI-compatible server. This step is necessary to execute the model pipeline.
llama-server \
-m /Path/to/your/model/Mistral-7B-Instruct-v0.3.Q6_K.gguf \
-c 3072 \
-t 8 \
--n-gpu-layers 100 \
--host 127.0.0.1 \
--port 8000 \
--jinja-m: path to your GGUF model-c 3072: context length-t 8: number of CPU threads--n-gpu-layers 100: offload first N layers to GPU (Mac Metal / CUDA)--jinja: enables Jinja-based templating for chat prompts- The API endpoint becomes:
http://127.0.0.1:8000/v1
Edit src/config.py:
TESSERACT_PATH = "/your/local/path/to/tesseract"
From the src directory:
# 1) Filtering of raw data
pipenv run python -m src.pipeline.filter.filtering
# writes: data/filter/*-documents-filtered.parquet
# 2) PDF text extraction → Markdown (DE)
pipenv run python -m src.pipeline.extract_text.extracting_text
# writes: data/extract_text/*-documents-md.parquet
# 3) Split multi-letter PDFs into letters (statements)
pipenv run python -m src.pipeline.split.splitting
# writes: data/split/statement-documents-md-split.parquet
# 4) Deduplicate and detect tables
pipenv run python -m src.pipeline.deduplicate.deduplicating
# writes: data/deduplicate/*-documents-unique.parquet
# 5) Chunk letters into model-friendly segments
pipenv run python -m src.pipeline.chunk.chunking
# writes: data/chunk/*-documents-chunked.parquet
# 6) Sentiment classification (local Mistral with function calling)
pipenv run python -m src.pipeline.classify.classifying
# writes: data/classify/statement-documents-classified.parquet
pipenv run python -m src.pipeline.classify.classifying_general_sentiment
# writes: data/classify/statement-documents-classified.parquet
# 7) Mapping (statement → consultation) using embedding sim + cross-encoder + article match signals
pipenv run python -m src.pipeline.map.mapping_prep
# writes: data/map/*-documents-mapping-prepped.parquet
pipenv run python -m src.pipeline.map.mapping
# writes: data/map/*-documents-mapped.parquet
# 8) Evaluation
## 8a) LLM-as-Judge at scale for classification (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_classification
# writes: data/judge/statement-documents-judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/statements-documents-classified-judge-eval.txt
## 8b) LLM-as-Judge at scale for mapping (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_mapping
# writes: data/judge/stmt_to_cons_mapping_judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/stmt_to_cons_mapping-judge-results.txtThis project relies exclusively on open-source models and tools, all of which are executed locally. Model weights are not redistributed as part of this repository. Users must obtain model weights separately and comply with the respective model licenses.
- Mistral-7B Instruct v0.3 https://github.com/mistralai/mistral-inference License: Apache 2.0
-
BAAI/bge-m3 (bi-encoder embeddings) https://huggingface.co/BAAI/bge-m3 License: MIT
-
BAAI/bge-reranker-v2-m3 (cross-encoder reranker) https://huggingface.co/BAAI/bge-reranker-v2-m3 License: MIT
-
Docling (PDF → structured Markdown extraction) https://github.com/DS4SD/docling License: Apache 2.0
-
Tesseract OCR https://github.com/tesseract-ocr/tesseract License: Apache 2.0
The source code in this repository is licensed under the MIT License.
This project also uses third-party libraries and models that remain under their
respective licenses. See THIRD_PARTY_LICENSES.md for details.
- Julia Netzel (Visium) — Core NLP pipeline, data processing, modeling, evaluation
- Vita Midori (Demokratis) — Dataset generation, UI, frontend support