Consultation Feedback Automation (PoC)

This repository provides the code for an automated pipeline to:

Filter raw data
Select consultations/statements by page count, table density, language (DE), and federal level.
PDF → Markdown (DE only)
Extract text from PDFs (OCR fallback where needed) of consultations and statements in Markdown format. Detect the language of each page and only keep pages in German language.
Statement letter detection and splitting
Split multi-letter statement PDFs into individual letters (documents) using page markers, regex rules and structural cues.
Duplicate and Table removal
Detect near duplicate documents and omit those from the dataset. Detect documents with a large amount of tables and exclude those as well.
Chunking
Segment consultation and statement documents into semantically coherent chunks (token-bounded).
Embeddings
Encode all chunks with BAAI/bge-m3; build FAISS indices for efficient retrieval/matching.
Sentiment classification (local LLM, function calling)
For each statement chunk, classify
supports | rejects | neutral | no_comment | agrees_with_others
using Mistral-7B-Instruct v0.3 (local, GGUF via llama.cpp) with strict JSON/function calling. Use a rule-based approach to classify the general sentiment per document based on the sentiments per chunk.
Article/keyword extraction (local LLM, function calling)
Extract referenced Artikel / Absätze / Schlüsselwörter from both consultations and statements to inform mapping.
Statement → Consultation mapping
Match statement chunks to the most relevant consultation chunks via a hybrid scorer:

semantic similarity (bge-m3 cosine),
cross-encoder reranker (BAAI/bge-reranker-v2-m3),
article/keyword alignment signals,
aggregate to a final score.

Evaluation

Sampled test sets (classification and mapping) with manually curated ground truth → report precision/recall/F1 & mapping metrics.
LLM-as-Judge (local Mistral, function calling) for scalable qualitative checks over the full dataset.

Data

We use the public Hugging Face dataset demokratis/consultation-documents:

Consultation Documents Features (https://huggingface.co/datasets/demokratis/consultation-documents/blob/main/consultation-documents-features.parquet)
Consultation Documents Preprocessed (https://huggingface.co/datasets/demokratis/consultation-documents/blob/main/consultation-documents-preprocessed.parquet)

Environment setup (Pipenv)

The python environment is managed with pipenv. You can set up your environment with the following steps:

Run pipenv lock to generate the Pipfile.lock which lists the version of your python packages.
Run pipenv install --dev to actually create a virtual environment and install the python packages. The flag --dev allows to install the development packages.
Run pipenv shell to activate your python environment.
Run pre-commit install to install pre-commits and make sure they are run at every commit.

Local models

Embeddings: BAAI/bge-m3 (downloaded on first use)
Reranker: BAAI/bge-reranker-v2-m3 (downloaded on first use)
LLM (OpenAI-compatible server via llama.cpp):
Get a local GGUF of Mistral-7B-Instruct v0.3 (e.g. from MaziyarPanahi):
https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF

To run the LLM, set up llama.cpp:

llama.cpp Setup

llama.cpp is a lightweight, high-performance runtime for LLaMA-compatible models (e.g., Mistral, Llama 3, Qwen) using GGUF model files. It supports CPU, Apple Metal, CUDA, and OpenCL.

Installation

macOS

Option 1: Install via Homebrew

brew update
brew install llama.cpp

Verify:

llama-cli --help
llama-server --help

Option 2: Build from source

xcode-select --install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_METAL=1        # GPU acceleration on Apple Silicon

Linux

Option 1 — Prebuilt binaries

Download a release from: https://github.com/ggerganov/llama.cpp/releases

Example:

wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-linux-x64.zip
unzip llama-linux-x64.zip
sudo mv llama* /usr/local/bin/

Option 2 — Build from source

sudo apt update
sudo apt install -y build-essential cmake

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build .
cmake --build build --config Release

CUDA build:

cmake -B build -DLLAMA_CUDA=1 .
cmake --build build --config Release

Windows

Option 1: Prebuilt binaries

Download ZIP from: https://github.com/ggerganov/llama.cpp/releases

Unzip and run:

llama-cli.exe --help

Option 2: Build using Visual Studio

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -A x64 .
cmake --build build --config Release

Executables will appear under:

build\bin\Release\

Running a GGUF Model

To run a model, download a .gguf file into a folder, for example:

models/Mistral-7B-Instruct-v0.3.Q6_K.gguf

Then launch the local OpenAI-compatible server. This step is necessary to execute the model pipeline.

llama-server \
  -m /Path/to/your/model/Mistral-7B-Instruct-v0.3.Q6_K.gguf \
  -c 3072 \
  -t 8 \
  --n-gpu-layers 100 \
  --host 127.0.0.1 \
  --port 8000 \
  --jinja

What this does

-m: path to your GGUF model
-c 3072: context length
-t 8: number of CPU threads
--n-gpu-layers 100: offload first N layers to GPU (Mac Metal / CUDA)
--jinja: enables Jinja-based templating for chat prompts
The API endpoint becomes:

http://127.0.0.1:8000/v1

Configure Paths

Edit src/config.py:

TESSERACT_PATH = "/your/local/path/to/tesseract"

Run the pipeline (manual — without DVC)

From the src directory:

# 1) Filtering of raw data
pipenv run python -m src.pipeline.filter.filtering
# writes: data/filter/*-documents-filtered.parquet

# 2) PDF text extraction → Markdown (DE)
pipenv run python -m src.pipeline.extract_text.extracting_text
# writes: data/extract_text/*-documents-md.parquet

# 3) Split multi-letter PDFs into letters (statements)
pipenv run python -m src.pipeline.split.splitting
# writes: data/split/statement-documents-md-split.parquet

# 4) Deduplicate and detect tables
pipenv run python -m src.pipeline.deduplicate.deduplicating
# writes: data/deduplicate/*-documents-unique.parquet

# 5) Chunk letters into model-friendly segments
pipenv run python -m src.pipeline.chunk.chunking
# writes: data/chunk/*-documents-chunked.parquet

# 6) Sentiment classification (local Mistral with function calling)
pipenv run python -m src.pipeline.classify.classifying
# writes: data/classify/statement-documents-classified.parquet
pipenv run python -m src.pipeline.classify.classifying_general_sentiment
# writes: data/classify/statement-documents-classified.parquet

# 7) Mapping (statement → consultation) using embedding sim + cross-encoder + article match signals
pipenv run python -m src.pipeline.map.mapping_prep
# writes: data/map/*-documents-mapping-prepped.parquet
pipenv run python -m src.pipeline.map.mapping
# writes: data/map/*-documents-mapped.parquet

# 8) Evaluation
## 8a) LLM-as-Judge at scale for classification (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_classification
# writes: data/judge/statement-documents-judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/statements-documents-classified-judge-eval.txt

## 8b) LLM-as-Judge at scale for mapping (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_mapping
# writes: data/judge/stmt_to_cons_mapping_judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/stmt_to_cons_mapping-judge-results.txt

Models and Tools

This project relies exclusively on open-source models and tools, all of which are executed locally. Model weights are not redistributed as part of this repository. Users must obtain model weights separately and comply with the respective model licenses.

Language Models

Mistral-7B Instruct v0.3 https://github.com/mistralai/mistral-inference License: Apache 2.0

Embedding & Reranking Models

BAAI/bge-m3 (bi-encoder embeddings) https://huggingface.co/BAAI/bge-m3 License: MIT
BAAI/bge-reranker-v2-m3 (cross-encoder reranker) https://huggingface.co/BAAI/bge-reranker-v2-m3 License: MIT

Document Processing

Docling (PDF → structured Markdown extraction) https://github.com/DS4SD/docling License: Apache 2.0
Tesseract OCR https://github.com/tesseract-ocr/tesseract License: Apache 2.0

License

The source code in this repository is licensed under the MIT License.

This project also uses third-party libraries and models that remain under their respective licenses. See THIRD_PARTY_LICENSES.md for details.

Contributors

Julia Netzel (Visium) — Core NLP pipeline, data processing, modeling, evaluation
Vita Midori (Demokratis) — Dataset generation, UI, frontend support

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consultation Feedback Automation (PoC)

Data

Environment setup (Pipenv)

Local models

llama.cpp Setup

Installation

macOS

Option 1: Install via Homebrew

Option 2: Build from source

Linux

Option 1 — Prebuilt binaries

Option 2 — Build from source

Windows

Option 1: Prebuilt binaries

Option 2: Build using Visual Studio

Running a GGUF Model

What this does

Configure Paths

Run the pipeline (manual — without DVC)

Models and Tools

Language Models

Embedding & Reranking Models

Document Processing

License

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Consultation Feedback Automation (PoC)

Data

Environment setup (Pipenv)

Local models

llama.cpp Setup

Installation

macOS

Option 1: Install via Homebrew

Option 2: Build from source

Linux

Option 1 — Prebuilt binaries

Option 2 — Build from source

Windows

Option 1: Prebuilt binaries

Option 2: Build using Visual Studio

Running a GGUF Model

What this does

Configure Paths

Run the pipeline (manual — without DVC)

Models and Tools

Language Models

Embedding & Reranking Models

Document Processing

License

Contributors

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages