Skip to content

DIGIT-X-Lab/MOSAICX

Repository files navigation

MOSAICX

PyPI DOI Python License Downloads

DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.


How It Works

flowchart LR
    A["PDF / Image / Text"] --> B["Chandra OCR"]
    B --> C["LLM Extraction"]
    C --> D["Structured JSON"]

    style A fill:#B5A89A,stroke:#8a7e72,color:#fff
    style B fill:#E87461,stroke:#c25a49,color:#fff
    style C fill:#E87461,stroke:#c25a49,color:#fff
    style D fill:#B5A89A,stroke:#8a7e72,color:#fff
Loading

MOSAICX converts medical documents (radiology reports, pathology summaries, clinical notes) into structured JSON. Define what to extract with a YAML template, point it at your documents, get clean data back. Every field comes with an excerpt citing the source text.

Why MOSAICX? Fully local (no PHI leaves your machine), schema-driven (you define exactly what to extract), VLM-powered OCR via Chandra (handles scans, handwriting, tables), and HIPAA-conformant de-identification built in.

Prerequisites

MOSAICX needs two servers running: an LLM for extraction and Chandra for OCR.

1. LLM Server

We recommend Gemma 4 31B via vLLM:

NVIDIA GPU:

pip install vllm
vllm serve google/gemma-4-31B-it --port 8000 --max-num-seqs 16

Adjust --max-num-seqs based on your GPU: 16 for 96GB (A6000 Pro), 8 for 80GB (A100), 4 for 24GB (4090).

Apple Silicon (Mac M1/M2/M3/M4):

pip install vllm-mlx
vllm-mlx serve mlx-community/gemma-4-31b-it-bf16 --port 8000

2. OCR Server (for PDFs and images)

Chandra is a VLM-based OCR that handles handwriting, tables, and complex layouts. Run it as a vLLM server on a GPU:

Option A -- Docker (easiest):

pip install chandra-ocr
VLLM_API_BASE=http://localhost:8001/v1 chandra_vllm

Option B -- bare vLLM:

vllm serve datalab-to/chandra-ocr-2 --port 8001

Note

Chandra is only needed for PDF/image documents. If you're extracting from .txt or .md files, you can skip this. Without Chandra, MOSAICX falls back to PaddleOCR automatically.

Verify

curl -s http://localhost:8000/v1/models    # LLM server

Tip

Any OpenAI-compatible LLM server works (Ollama, llama.cpp, SGLang). vLLM + Gemma 4 31B is what we test against.

Install

python -m venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git

With uv (faster):

uv venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
uv pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git

Then create a .env file in your working directory:

MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1

MOSAICX reads this automatically. Check everything works:

mosaicx doctor

Three Things You Can Do

1. Create a Template

Tell MOSAICX what to extract using natural language:

mosaicx template create --describe "chest CT with nodules, lung-rads score, and impression"

This generates a YAML template with typed fields (strings, numbers, enums, nested objects, lists). MOSAICX also ships with built-in templates:

mosaicx template list

2. Extract Structured Data

Single document:

mosaicx extract --document report.pdf --template chest_ct

Batch (parallel):

mosaicx extract --dir ./reports/ --template chest_ct --workers 8 --output-dir ./results/

Output is clean JSON with {value, excerpt} for every field:

{
  "indication": {
    "value": "Follow-up pulmonary nodule",
    "excerpt": "Indication: Follow-up of incidentally detected pulmonary nodule"
  },
  "impression": {
    "value": "Stable 6mm nodule, recommend 12-month follow-up",
    "excerpt": "Impression: Stable 6mm solid nodule in right lower lobe"
  }
}

3. De-identify Documents

Remove PHI with HIPAA conformance (LLM + regex safety net):

mosaicx deidentify --document note.pdf
mosaicx deidentify --document note.pdf -o redacted.json

Batch:

mosaicx deidentify --dir ./notes/ --workers 4 --output-dir ./cleaned/

Output:

{
  "conformance": "hipaa",
  "redacted_text": "Patient [REDACTED] presented with...",
  "phi": [
    {"value": "John Doe", "type": "NAME", "excerpt": "Patient John Doe presented"},
    {"value": "01/15/1990", "type": "DATE", "excerpt": "DOB: 01/15/1990"}
  ]
}

Privacy

Important

Data stays on your machine. MOSAICX runs against a local LLM server -- no external API calls, no cloud uploads. De-identification follows HIPAA Safe Harbor rules by default.

Configuration

All settings live in a .env file (recommended) or environment variables with the MOSAICX_ prefix:

MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_OCR_ENGINE=chandra
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1
# View active config
mosaicx config show

See docs/configuration.md for the full reference.

Documentation

Guide Description
Quickstart First successful run in ~10 minutes
Getting Started Install, first extraction, basics
CLI Reference Every command, every flag, examples
Schemas & Templates Create and manage extraction templates
Configuration Env vars, backends, OCR, export formats
Developer Guide Custom pipelines, Python SDK, MCP server

Development

git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]"
pytest tests/ -q

Citation

@software{mosaicx2025,
  title   = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author  = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
  year    = {2025},
  url     = {https://github.com/DIGIT-X-Lab/MOSAICX},
  doi     = {10.5281/zenodo.17601890}
}

License

Apache 2.0 -- see LICENSE.

Contact

Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues

About

Medical cOmputational Suite for Advanced Intelligent eXtraction of Healthcare data using local LLMs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages