Skip to content

swiss/oco-chunks-modeling

Repository files navigation

Consultation Feedback Automation (PoC)

This repository provides the code for an automated pipeline to:

  1. Filter raw data
    Select consultations/statements by page count, table density, language (DE), and federal level.
  2. PDF → Markdown (DE only)
    Extract text from PDFs (OCR fallback where needed) of consultations and statements in Markdown format. Detect the language of each page and only keep pages in German language.
  3. Statement letter detection and splitting
    Split multi-letter statement PDFs into individual letters (documents) using page markers, regex rules and structural cues.
  4. Duplicate and Table removal
    Detect near duplicate documents and omit those from the dataset. Detect documents with a large amount of tables and exclude those as well.
  5. Chunking
    Segment consultation and statement documents into semantically coherent chunks (token-bounded).
  6. Embeddings
    Encode all chunks with BAAI/bge-m3; build FAISS indices for efficient retrieval/matching.
  7. Sentiment classification (local LLM, function calling)
    For each statement chunk, classify
    supports | rejects | neutral | no_comment | agrees_with_others
    using Mistral-7B-Instruct v0.3 (local, GGUF via llama.cpp) with strict JSON/function calling. Use a rule-based approach to classify the general sentiment per document based on the sentiments per chunk.
  8. Article/keyword extraction (local LLM, function calling)
    Extract referenced Artikel / Absätze / Schlüsselwörter from both consultations and statements to inform mapping.
  9. Statement → Consultation mapping
    Match statement chunks to the most relevant consultation chunks via a hybrid scorer:
  • semantic similarity (bge-m3 cosine),
  • cross-encoder reranker (BAAI/bge-reranker-v2-m3),
  • article/keyword alignment signals,
  • aggregate to a final score.
  1. Evaluation
  • Sampled test sets (classification and mapping) with manually curated ground truth → report precision/recall/F1 & mapping metrics.
  • LLM-as-Judge (local Mistral, function calling) for scalable qualitative checks over the full dataset.

Data

We use the public Hugging Face dataset demokratis/consultation-documents:

Environment setup (Pipenv)

The python environment is managed with pipenv. You can set up your environment with the following steps:

  • Run pipenv lock to generate the Pipfile.lock which lists the version of your python packages.
  • Run pipenv install --dev to actually create a virtual environment and install the python packages. The flag --dev allows to install the development packages.
  • Run pipenv shell to activate your python environment.
  • Run pre-commit install to install pre-commits and make sure they are run at every commit.

Local models

To run the LLM, set up llama.cpp:

llama.cpp Setup

llama.cpp is a lightweight, high-performance runtime for LLaMA-compatible models (e.g., Mistral, Llama 3, Qwen) using GGUF model files. It supports CPU, Apple Metal, CUDA, and OpenCL.


Installation

macOS

Option 1: Install via Homebrew
brew update
brew install llama.cpp

Verify:

llama-cli --help
llama-server --help

Option 2: Build from source

xcode-select --install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_METAL=1        # GPU acceleration on Apple Silicon

Linux

Option 1 — Prebuilt binaries

Download a release from: https://github.com/ggerganov/llama.cpp/releases

Example:

wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-linux-x64.zip
unzip llama-linux-x64.zip
sudo mv llama* /usr/local/bin/

Option 2 — Build from source

sudo apt update
sudo apt install -y build-essential cmake

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build .
cmake --build build --config Release

CUDA build:

cmake -B build -DLLAMA_CUDA=1 .
cmake --build build --config Release

Windows

Option 1: Prebuilt binaries

Download ZIP from: https://github.com/ggerganov/llama.cpp/releases

Unzip and run:

llama-cli.exe --help

Option 2: Build using Visual Studio

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -A x64 .
cmake --build build --config Release

Executables will appear under:

build\bin\Release\

Running a GGUF Model

To run a model, download a .gguf file into a folder, for example:

models/Mistral-7B-Instruct-v0.3.Q6_K.gguf

Then launch the local OpenAI-compatible server. This step is necessary to execute the model pipeline.

llama-server \
  -m /Path/to/your/model/Mistral-7B-Instruct-v0.3.Q6_K.gguf \
  -c 3072 \
  -t 8 \
  --n-gpu-layers 100 \
  --host 127.0.0.1 \
  --port 8000 \
  --jinja

What this does

  • -m: path to your GGUF model
  • -c 3072: context length
  • -t 8: number of CPU threads
  • --n-gpu-layers 100: offload first N layers to GPU (Mac Metal / CUDA)
  • --jinja: enables Jinja-based templating for chat prompts
  • The API endpoint becomes:
http://127.0.0.1:8000/v1

Configure Paths

Edit src/config.py:

TESSERACT_PATH = "/your/local/path/to/tesseract"

Run the pipeline (manual — without DVC)

From the src directory:

# 1) Filtering of raw data
pipenv run python -m src.pipeline.filter.filtering
# writes: data/filter/*-documents-filtered.parquet

# 2) PDF text extraction → Markdown (DE)
pipenv run python -m src.pipeline.extract_text.extracting_text
# writes: data/extract_text/*-documents-md.parquet

# 3) Split multi-letter PDFs into letters (statements)
pipenv run python -m src.pipeline.split.splitting
# writes: data/split/statement-documents-md-split.parquet

# 4) Deduplicate and detect tables
pipenv run python -m src.pipeline.deduplicate.deduplicating
# writes: data/deduplicate/*-documents-unique.parquet

# 5) Chunk letters into model-friendly segments
pipenv run python -m src.pipeline.chunk.chunking
# writes: data/chunk/*-documents-chunked.parquet

# 6) Sentiment classification (local Mistral with function calling)
pipenv run python -m src.pipeline.classify.classifying
# writes: data/classify/statement-documents-classified.parquet
pipenv run python -m src.pipeline.classify.classifying_general_sentiment
# writes: data/classify/statement-documents-classified.parquet

# 7) Mapping (statement → consultation) using embedding sim + cross-encoder + article match signals
pipenv run python -m src.pipeline.map.mapping_prep
# writes: data/map/*-documents-mapping-prepped.parquet
pipenv run python -m src.pipeline.map.mapping
# writes: data/map/*-documents-mapped.parquet

# 8) Evaluation
## 8a) LLM-as-Judge at scale for classification (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_classification
# writes: data/judge/statement-documents-judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/statements-documents-classified-judge-eval.txt

## 8b) LLM-as-Judge at scale for mapping (local Mistral, function calling)
pipenv run python -m src.pipeline.judge.judging_mapping
# writes: data/judge/stmt_to_cons_mapping_judged.parquet
pipenv run python -m src.pipeline.evaluate.evaluating_judging
# prints and writes classification report: data/evaluate/stmt_to_cons_mapping-judge-results.txt

Models and Tools

This project relies exclusively on open-source models and tools, all of which are executed locally. Model weights are not redistributed as part of this repository. Users must obtain model weights separately and comply with the respective model licenses.

Language Models

Embedding & Reranking Models

Document Processing

License

The source code in this repository is licensed under the MIT License.

This project also uses third-party libraries and models that remain under their respective licenses. See THIRD_PARTY_LICENSES.md for details.

Contributors

  • Julia Netzel (Visium) — Core NLP pipeline, data processing, modeling, evaluation
  • Vita Midori (Demokratis) — Dataset generation, UI, frontend support

About

ML models to map unstructured documents to structured chunks.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors