DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.
flowchart LR
A["PDF / Image / Text"] --> B["Chandra OCR"]
B --> C["LLM Extraction"]
C --> D["Structured JSON"]
style A fill:#B5A89A,stroke:#8a7e72,color:#fff
style B fill:#E87461,stroke:#c25a49,color:#fff
style C fill:#E87461,stroke:#c25a49,color:#fff
style D fill:#B5A89A,stroke:#8a7e72,color:#fff
MOSAICX converts medical documents (radiology reports, pathology summaries, clinical notes) into structured JSON. Define what to extract with a YAML template, point it at your documents, get clean data back. Every field comes with an excerpt citing the source text.
Why MOSAICX? Fully local (no PHI leaves your machine), schema-driven (you define exactly what to extract), VLM-powered OCR via Chandra (handles scans, handwriting, tables), and HIPAA-conformant de-identification built in.
MOSAICX needs two servers running: an LLM for extraction and Chandra for OCR.
We recommend Gemma 4 31B via vLLM:
NVIDIA GPU:
pip install vllm
vllm serve google/gemma-4-31B-it --port 8000 --max-num-seqs 16Adjust
--max-num-seqsbased on your GPU: 16 for 96GB (A6000 Pro), 8 for 80GB (A100), 4 for 24GB (4090).
Apple Silicon (Mac M1/M2/M3/M4):
pip install vllm-mlx
vllm-mlx serve mlx-community/gemma-4-31b-it-bf16 --port 8000Chandra is a VLM-based OCR that handles handwriting, tables, and complex layouts. Run it as a vLLM server on a GPU:
Option A -- Docker (easiest):
pip install chandra-ocr
VLLM_API_BASE=http://localhost:8001/v1 chandra_vllmOption B -- bare vLLM:
vllm serve datalab-to/chandra-ocr-2 --port 8001Note
Chandra is only needed for PDF/image documents. If you're extracting from .txt or .md files, you can skip this. Without Chandra, MOSAICX falls back to PaddleOCR automatically.
curl -s http://localhost:8000/v1/models # LLM serverTip
Any OpenAI-compatible LLM server works (Ollama, llama.cpp, SGLang). vLLM + Gemma 4 31B is what we test against.
python -m venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.gitWith uv (faster):
uv venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
uv pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.gitThen create a .env file in your working directory:
MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1MOSAICX reads this automatically. Check everything works:
mosaicx doctorTell MOSAICX what to extract using natural language:
mosaicx template create --describe "chest CT with nodules, lung-rads score, and impression"This generates a YAML template with typed fields (strings, numbers, enums, nested objects, lists). MOSAICX also ships with built-in templates:
mosaicx template listSingle document:
mosaicx extract --document report.pdf --template chest_ctBatch (parallel):
mosaicx extract --dir ./reports/ --template chest_ct --workers 8 --output-dir ./results/Output is clean JSON with {value, excerpt} for every field:
{
"indication": {
"value": "Follow-up pulmonary nodule",
"excerpt": "Indication: Follow-up of incidentally detected pulmonary nodule"
},
"impression": {
"value": "Stable 6mm nodule, recommend 12-month follow-up",
"excerpt": "Impression: Stable 6mm solid nodule in right lower lobe"
}
}Remove PHI with HIPAA conformance (LLM + regex safety net):
mosaicx deidentify --document note.pdf
mosaicx deidentify --document note.pdf -o redacted.jsonBatch:
mosaicx deidentify --dir ./notes/ --workers 4 --output-dir ./cleaned/Output:
{
"conformance": "hipaa",
"redacted_text": "Patient [REDACTED] presented with...",
"phi": [
{"value": "John Doe", "type": "NAME", "excerpt": "Patient John Doe presented"},
{"value": "01/15/1990", "type": "DATE", "excerpt": "DOB: 01/15/1990"}
]
}Important
Data stays on your machine. MOSAICX runs against a local LLM server -- no external API calls, no cloud uploads. De-identification follows HIPAA Safe Harbor rules by default.
All settings live in a .env file (recommended) or environment variables with the MOSAICX_ prefix:
MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_OCR_ENGINE=chandra
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1# View active config
mosaicx config showSee docs/configuration.md for the full reference.
| Guide | Description |
|---|---|
| Quickstart | First successful run in ~10 minutes |
| Getting Started | Install, first extraction, basics |
| CLI Reference | Every command, every flag, examples |
| Schemas & Templates | Create and manage extraction templates |
| Configuration | Env vars, backends, OCR, export formats |
| Developer Guide | Custom pipelines, Python SDK, MCP server |
git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]"
pytest tests/ -q@software{mosaicx2025,
title = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year = {2025},
url = {https://github.com/DIGIT-X-Lab/MOSAICX},
doi = {10.5281/zenodo.17601890}
}Apache 2.0 -- see LICENSE.
Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues
