PechaBridge is a library for Tibetan document understanding with a focus on training OCR and Line Segmentation models for text retrieval in Tibetan script.
The primary entrypoint for end-to-end usage is the OCR Workbench (ui_ocr_workbench.py).
The figure below shows an example line segmentation result for a Tibetan pecha page from the Staatsbibliothek zu Berlin (SBB). Each detected line is passed through the OCR model to extract the Tibetan text.
Sample OCR output for the page shown above:
༄༅། །བྱས་ལ། །སྔགས་དྲུག་ཕྱག་རྒྱ་དྲུག་གིས་སྦྱངས། སྤེལ་རྒྱུན་ཤཚམ།
སྐྱིད་དམིགས་དང་ཉིད་ཨོ་རྒྱན་རིགས་འཛིན་བསམ་དགྱེས་པའི་དབྱངས་ནི་འདི་སྐད་
དོ། །སྟོང་གསུམ་ཀུན་ནི་མཆོག་འབུན་འགྱར་བའི། །སྤོས་དང་
སྨན་སྦུར་ཆུ་གཙང་དག་བཞི་ཁྲུས། །ཡ དི་བཞིན་སྲིན་ཕུད་མཁན་
⚠ The transcript above is current model output (April 2026) and may contain recognition errors. Accuracy improves with more training data.
- End-to-end Tibetan Pecha OCR pipeline: Automatically segments lines on digitised Pecha pages (e.g. from the Staatsbibliothek zu Berlin) using a YOLO-based line segmentation model, then transcribes each line with a fine-tuned Donut VLM. A single
batch-ocrCLI command covers download → line detection → OCR → transcript export. - Synthetic multi-class dataset generation: Creates YOLO-ready pages for Tibetan number words, Tibetan text blocks, and Chinese number words.
- Standalone DONUT/TroCR OCR training: Trains OCR directly on OpenPecha/BDRC line manifests (
train-donut-ocr) withnone|pb|gray|bdrc|rgbimage preprocessing and CER evaluation. - Retrieval encoder training + eval: Trains ViT/DINOv2 patch encoders with mp-InfoNCE and exports FAISS-ready embeddings plus cross-page evaluation.
- Dual vision-text encoder (CLIP-style) training: Trains DINOv2 + text encoder (e.g. ByT5) on line image/text manifests (
line_clip) for text-to-line and line-to-text retrieval.
pip install -r requirements.txtrequirements.txt is now the unified dependency file for the repository.
Legacy files requirements-ui.txt, requirements-vlm.txt, and requirements-lora.txt remain as compatibility wrappers.
Pretrained PechaBridge models are hosted on HuggingFace:
| Model | HuggingFace Repo | Description |
|---|---|---|
| DONUT OCR | TibetanCodexAITeam/PechaBridgeOCR |
VisionEncoderDecoder OCR for Tibetan line images |
| Line Segmentation | TibetanCodexAITeam/PechaBridgeLineSegmentation |
YOLO segmentation model for Tibetan text lines |
# Download both models into models/ (auto-detected by UI and CLI):
python cli.py download-models
# Download only the OCR model:
python cli.py download-models --models ocr
# Download only the line segmentation model:
python cli.py download-models --models line
# Force re-download:
python cli.py download-models --forceAfter download the directory layout is:
models/
ocr/
PechaBridgeOCR/ ← DONUT checkpoint (auto-detected by ui_ocr_workbench.py)
config.json
model.safetensors
tokenizer_config.json
preprocessor_config.json
repro/
image_preprocess.json
generate_config.json
line_segmentation/
PechaBridgeLineSegmentation.pt ← YOLO .pt (auto-detected by ui_workbench.py)
Both UI workbenches (ui_ocr_workbench.py, ui_workbench.py) scan these directories on startup and populate their model dropdowns automatically — no manual path configuration needed.
python cli.py batch-ocr \
--ocr-model models/ocr/PechaBridgeOCR \
--line-model models/line_segmentation/PechaBridgeLineSegmentation.pt \
--layout-engine yolo_line \
--ocr-engine donut \
--input-dir /path/to/pecha/imagesEach image produces a .txt transcript and an *_overlay.jpg with the
detected line boxes drawn on the source image (pass --no-save-overlay to
skip the overlay images).
PechaBridge also integrates the excellent OCR pipeline from the BDRC Tibetan OCR App by Buddhist Digital Resource Center (BDRC). Many thanks to the BDRC team for their outstanding work on open Tibetan OCR tooling and for making their models freely available! 🙏
The BDRC assets (line/layout ONNX models + OCR model bundle) are downloaded from the BDRC GitHub release — no HuggingFace account needed.
# Download all BDRC assets (line model, layout model, OCR bundle):
python cli.py download-bdrc-models
# Download only the line/layout ONNX models:
python cli.py download-bdrc-models --assets line,layout
# Download only the OCR model bundle:
python cli.py download-bdrc-models --assets ocr
# Force re-download:
python cli.py download-bdrc-models --forceAfter download the directory layout is:
models/
bdrc/
Lines/
PhotiLines.onnx ← BDRC line segmentation ONNX model
config.json
Layout/
photi.onnx ← BDRC layout ONNX model
config.json
OCRModels/
Woodblock/ ← BDRC OCR model (auto-selected by default)
...
# BDRC line segmentation + BDRC OCR:
python cli.py batch-ocr \
--layout-engine bdrc_line \
--bdrc-line-model models/bdrc/Layout \
--ocr-engine bdrc_ocr \
--bdrc-ocr-model models/bdrc/OCRModels/Woodblock \
--input-dir /path/to/pecha/images
# BDRC line segmentation + PechaBridge DONUT OCR:
python cli.py batch-ocr \
--layout-engine bdrc_line \
--bdrc-line-model models/bdrc/Layout \
--ocr-engine donut \
--ocr-model models/ocr/PechaBridgeOCR \
--input-dir /path/to/pecha/imagesTip:
--bdrc-line-modeland--bdrc-ocr-modelare optional — when omitted the CLI auto-downloads the models intomodels/bdrc/on first use. The auto-selected model ismodels/bdrc/Layout/(multi-class layout model), which matches the OCR Workbench UI default.
The Staatsbibliothek zu Berlin (SBB) provides free access to digitised Tibetan manuscripts and block prints via their IIIF API. You can download full-resolution page images for any digitised work using its PPN (Pica Production Number).
Pass --ppn directly to batch-ocr to run the full pipeline in a single command.
Images are downloaded automatically before OCR starts:
# Full pipeline with BDRC layout + DONUT OCR (recommended for pecha pages):
python cli.py batch-ocr \
--ppn 337138764X \
--layout-engine bdrc_line \
--ocr-engine donut \
--ocr-model models/ocr/PechaBridgeOCR
# Full pipeline with BDRC layout + BDRC OCR (no --ocr-model needed):
python cli.py batch-ocr \
--ppn 337138764X \
--layout-engine bdrc_line \
--ocr-engine bdrc_ocr
# Limit to the first 10 pages and save images to a custom folder:
python cli.py batch-ocr \
--ppn 337138764X \
--layout-engine bdrc_line \
--ocr-engine bdrc_ocr \
--sbb-max-pages 10 \
--sbb-output-dir sbb_images/PPN337138764XDownloaded images are saved to sbb_images/<PPN>/ by default (override with --sbb-output-dir).
Use download-sbb-images to fetch images without running OCR.
A metadata.json file is always written alongside the images; it contains the full
document metadata (title, author, date, publisher, language, identifiers, subjects, …)
plus the ordered list of source URLs so transcripts can be matched back to the correct
page later.
# Download all pages of PPN337138764X into sbb_images/337138764X/ (default):
python cli.py download-sbb-images --ppn 337138764X
# Download into a custom output directory:
python cli.py download-sbb-images --ppn 337138764X --output-dir sbb_images/PPN337138764X
# Limit to the first 10 pages:
python cli.py download-sbb-images --ppn 337138764X --max-pages 10
# Print document metadata to the terminal as well:
python cli.py download-sbb-images --ppn 337138764X --show-metadata
# Reduce parallel workers (default: 8) or disable SSL verification:
python cli.py download-sbb-images --ppn 337138764X --workers 4 --no-verify-sslThe output directory will contain:
*.jpg/*.png— full-resolution page images (one file per page)metadata.json— full document metadata + per-page{index, filename, source_url}entries
Note: The SBB IIIF API is publicly accessible — no login or API key required. Replace
337138764Xwith any other SBB PPN to download a different work. You can find PPNs in the SBB catalogue or the SBB digital collections. The UI workbench (ui_workbench.py) also has a built-in PPN Downloader tab.
The OCR Workbench (ui_ocr_workbench.py) is a dedicated Gradio UI for interactive Tibetan OCR on pecha page images.
python ui_ocr_workbench.py-
Download the pretrained models (once):
python cli.py download-models
-
Start the workbench — models are auto-detected from
models/ocr/andmodels/line_segmentation/. -
Upload a pecha page image and click Run OCR.
| Mode | Description |
|---|---|
| Fully Automatic OCR | Segments all lines on the page, runs OCR on each, and returns the full transcript. |
| Manual Mode | Click a line on the page or draw a bounding box with two clicks to OCR a single region. |
| Backend | When to use |
|---|---|
| Classical CV | Fast, no GPU needed. Works well on clean woodblock prints. Requires a YOLO layout model (models/layout/). |
| Pretrained YOLO Model | Best accuracy for complex or degraded pages. Uses models/line_segmentation/PechaBridgeLineSegmentation.pt. |
| BDRC Line Model | Alternative ONNX-based segmentation from the BDRC Tibetan OCR app. Auto-downloaded on first use. |
| Engine | Description |
|---|---|
| DONUT | Default. VisionEncoderDecoder model from models/ocr/PechaBridgeOCR/. Preprocessing pipeline is auto-detected from the checkpoint's repro bundle. |
| BDRC OCR | ONNX-based CTC OCR from the BDRC Tibetan OCR app. Auto-downloaded on first use. |
Upload page image
→ Select line segmentation backend (YOLO recommended)
→ Select OCR engine (DONUT)
→ Click "Run OCR"
→ Inspect annotated page + transcript
→ Save results
- The DONUT model and YOLO line segmentation model are loaded once and cached in memory for the session.
- The preprocessing pipeline (
bdrc,gray,rgb) is read automatically fromrepro/image_preprocess.jsoninside the checkpoint — no manual selection needed when using downloaded models. - For remote server usage, use SSH port forwarding and keep
UI_SHARE=false.
Both ui_ocr_workbench.py and ui_workbench.py accept optional runtime flags via environment variables:
export UI_HOST=127.0.0.1 # use 0.0.0.0 for remote server binding
export UI_PORT=7860
export UI_SHARE=false # set true only if you explicitly want a public Gradio link
python ui_ocr_workbench.py # or ui_workbench.pyFor remote server usage, keep UI_SHARE=false and use SSH port forwarding:
ssh -L 7860:127.0.0.1:7860 <user>@<server>Then open http://127.0.0.1:7860 on your laptop.
The project includes a unified CLI entrypoint:
python cli.py -hKey commands:
# Texture LoRA dataset prep
python cli.py prepare-texture-lora-dataset --input_dir ./sbb_images --output_dir ./datasets/texture-lora-dataset
# Train texture LoRA (SDXL or SD2.1 via --model_family)
python cli.py train-texture-lora --dataset_dir ./datasets/texture-lora-dataset --output_dir ./models/texture-lora-sdxl
# Texture augmentation inference
python cli.py texture-augment --input_dir ./datasets/tibetan-yolo-ui/train/images --output_dir ./datasets/tibetan-yolo-ui-textured
# Train image encoder (self-supervised)
python cli.py train-image-encoder --input_dir ./sbb_images --output_dir ./models/image-encoder
# Train text encoder (unsupervised, Unicode-normalized)
python cli.py train-text-encoder --input_dir ./data/corpora --output_dir ./models/text-encoder
# Generate patch retrieval dataset (YOLO textbox -> lines -> multi-scale patches)
python cli.py gen-patches \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_patches \
--no-samples 20 \
--debug-dump 5
# Generate weak OCR labels for patch crops (optional retrieval weak positives)
python cli.py weak-ocr-label \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/weak_ocr.parquet
# Mine cross-page MNN positives for retrieval training
python cli.py mine-mnn-pairs \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/mnn_pairs.parquet \
--config ./configs/mnn_mining.yaml \
--num-workers 8
# Train patch retrieval encoder with mp-InfoNCE (MNN/OCR/both)
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_patches \
--output-dir ./models/text_hierarchy_vit_mpnce \
--model-name-or-path facebook/dinov2-base \
--train-mode patch_mpnce \
--positive-sources both \
--pairs-parquet ./datasets/text_patches/meta/mnn_pairs.parquet \
--weak-ocr-parquet ./datasets/text_patches/meta/weak_ocr.parquet
# Train line-level dual vision-text encoder (CLIP-style) on OCR manifests
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/openpecha_ocr_lines \
--output-dir ./models/line_clip_openpecha_bdrc_dinov2_byt5 \
--train-mode line_clip \
--train-manifest ./datasets/openpecha_ocr_lines/train/meta/lines.jsonl \
--val-manifest ./datasets/openpecha_ocr_lines/eval/meta/lines.jsonl \
--model-name-or-path facebook/dinov2-base \
--text-encoder-name-or-path google/byt5-small \
--image-preprocess-pipeline bdrc
# Warm line_clip workbench corpus cache (best model auto-selected)
python cli.py warm-line-clip-workbench-cache \
--models-dir ./models \
--dataset-root ./datasets/openpecha_ocr_lines \
--splits eval,test \
--only both \
--device cpu
# Probe best line_clip model on random samples (in-split and/or cross-split)
python cli.py probe-line-clip-workbench-random-samples \
--dataset-root ./datasets/openpecha_ocr_lines \
--cross-split eval:test \
--samples-per-split 200 \
--summary-only
# Train Donut/TroCR OCR directly on line manifests (with CER on eval split)
python cli.py train-donut-ocr \
--train_manifest ./datasets/openpecha_ocr_lines/train/meta/lines.jsonl \
--val_manifest ./datasets/openpecha_ocr_lines/eval/meta/lines.jsonl \
--output_dir ./models/donut_openpecha_rgb \
--model_name_or_path microsoft/trocr-base-stage1 \
--tokenizer_path openpecha/BoSentencePiece \
--image_preprocess_pipeline rgb
# Cross-page FAISS evaluation from exported embeddings
python cli.py eval-faiss-crosspage \
--embeddings-npy ./models/text_hierarchy_vit_mpnce/faiss_embeddings.npy \
--embeddings-meta ./models/text_hierarchy_vit_mpnce/faiss_embeddings_meta.parquet \
--mnn-pairs ./datasets/text_patches/meta/mnn_pairs.parquet \
--output-dir ./models/text_hierarchy_vit_mpnce/eval_crosspage
# Full label-1 OCR workflow (generate -> prepare -> train)
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--model_output_dir ./models/donut-ocr-label1Typical outputs in --output_dir:
checkpoint-*(step-based HF checkpoints)checkpoint-epoch-<N>-cer-<X>symlink aliases (if eval happened before save)model/(finalVisionEncoderDecoderModel)tokenizer/image_processor/train_summary.json
Each checkpoint also contains a repro/ bundle with:
repro/image_preprocess.json— preprocessing pipeline name (bdrc,gray,rgb)repro/generate_config.json— generation parameters (max_length, decoder_start_token_id, …)repro/tokenizer/+repro/image_processor/— self-contained copies for reproducibility
Current Workbench support:
- The Workbench supports the DONUT OCR workflow runner (
run-donut-ocr-workflow) and monitors training logs/output dirs. ui_ocr_workbench.pyauto-scansmodels/ocr/for checkpoints and exposes them in the DONUT dropdown.- Training and evaluation are fully supported via CLI (
cli.py train-donut-ocr).
Typical outputs in --output_dir:
text_hierarchy_vit_backbone/(image backbone + image processor)text_hierarchy_projection_head.pt(image projection head)text_hierarchy_clip_text_encoder/(HF text encoder + tokenizer)text_hierarchy_clip_text_projection_head.pt(text projection head)faiss_embeddings.npy,faiss_embeddings_meta.parquettraining_config.json- optional
checkpoint_step_*snapshots (image backbone/head checkpoints)
Current Workbench support:
- The Workbench can scan and use the image backbone + image projection head for line/block encoding previews and FAISS-related UI flows.
- The text encoder part of
line_clipis currently not yet consumed by the Workbench UI (text query encoding remains a CLI / future UI extension topic). - Training artifacts are therefore usable in the UI for image-side encoding, while full dual-encoder evaluation is primarily CLI-driven.
For local file serving in Label Studio, set:
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/absolute/path/to/your/dataset/rootThen use the Workbench export actions.
- NEW: Full DONUT OCR training playbook (Tiny-Pretraining, Anti-Collapse, Full-Run recipes): docs/donut_training_guide.md
- CLI command reference and end-to-end examples: README_CLI.md
- Pseudo-labeling and Label Studio workflow: README_PSEUDO_LABELING_LABEL_STUDIO.md
- Patch dataset generation (YOLO textbox -> lines -> sub-patches): docs/dataset_generation.md
- Robust MNN mining for cross-page positives: docs/mnn_mining.md
- Retrieval training with mp-InfoNCE (MNN/OCR weak positives): docs/retrieval_mpnce_training.md
- DONUT/TroCR OCR training (OpenPecha/BDRC manifests, CER, checkpoints): README_DONUT_OCR.md
- Line-CLIP dual vision-text encoder training (DINOv2 + text encoder): README_LINE_CLIP_DUAL_ENCODER.md
- line_clip cache warmup + in-split/cross-split probing & evaluation guide: docs/line_clip_dual_encoder_probe_guide.md
- Weak OCR labeling for patch datasets: docs/weak_ocr.md
- Diffusion + LoRA details: docs/texture_augmentation.md
- Retrieval roadmap: docs/tibetan_ngram_retrieval_plan.md
- Chinese number corpus note: data/corpora/Chinese Number Words/README.md
MIT, see LICENSE.

