End-to-end pipeline to extract Arabic text from scanned PDF, diacritize it, translate it, and generate a new translated PDF — fully offline using open-source tools.
Built for translating Arabic Islamic scholarly texts (kitab, hadith, tafsir, fiqh) to English and Indonesian, but supports 200 languages via NLLB-200.
Tarjim automates the process of handling Arabic documents by integrating:
- PDF rendering — PyMuPDF converts PDF pages to high-resolution images
- Arabic OCR — Surya OCR detects and recognizes Arabic text with bounding boxes
- Tashkeel (optional) — CATT restores harakat (short vowel marks) to undiacritized kitab text before translation, improving accuracy
- Offline translation — NLLB-200 1.3B via CTranslate2 INT8 provides direct translation across 200 languages with no pivot needed; Argos Translate available as a lightweight fallback
- Text overlay — Translated (and optionally diacritized) text replaces original Arabic on the PDF with proper font sizing and word-wrapping
- PDF generation — Modified pages are saved as a new translated PDF
Designed for researchers, students, and anyone who needs an offline, private, and flexible document translation pipeline.
- Extract text from scanned or non-searchable Arabic PDFs (OCR via Surya)
- Tashkeel: restore Arabic harakat (حركات) before translation using CATT — 2024 SOTA, trained on 75M words from 97 classical Shamela books
- Translate to 200 languages directly using NLLB-200 (no English pivot needed for ar→id)
- Lightweight fallback: Argos Translate for CPU-only or low-memory setups
- Two overlay modes: replace (overlay on original) or clean (white background)
- Optional bilingual output: show diacritized Arabic + translation stacked in the PDF
- Fully offline — no API keys, no internet required after first-run download
- Modular Python code (OCR / tashkeel / translation / overlay / PDF I/O separated)
- Web UI via FastAPI for browser-based usage
- CLI for batch/scripted processing
graph TD
A[Input PDF - Scanned Arabic] --> B[Convert to Images - PyMuPDF 300 DPI]
B --> C[Arabic OCR - Surya OCR]
C --> D{Tashkeel enabled?}
D -->|Yes| E[Diacritize - CATT EncoderDecoder]
D -->|No| F[Raw OCR text]
E --> G[Translate - NLLB-200 or Argos]
F --> G
G --> H[Text Overlay - PIL with RTL shaping]
H --> I[Save as PDF]
I --> J[Translated PDF Output]
| Component | Library | Notes |
|---|---|---|
| PDF handling | PyMuPDF | Read/write PDF, render pages to images |
| OCR | Surya OCR 0.17.x | Arabic text detection + recognition |
| Tashkeel | CATT (catt-tashkeel) | 2024 SOTA harakat restoration for classical Arabic |
| Translation (primary) | NLLB-200 1.3B via CTranslate2 INT8 | 200 languages, direct pairs, ~2.8 GB VRAM |
| Translation (fallback) | Argos Translate | Lightweight, CPU-friendly, lower quality |
| Arabic text rendering | arabic-reshaper + python-bidi | Correct RTL shaping for PIL overlay |
| Image processing | Pillow, OpenCV | Image manipulation, text overlay |
| Web API | FastAPI + Uvicorn | Browser-based upload/translate UI |
| Progress | tqdm | Progress bars for CLI |
- Python 3.10+
- GPU recommended: RTX 3050 6GB or better (NLLB-200 uses ~2.8 GB VRAM + ~2.5 GB for Surya OCR)
- CPU-only mode works but is significantly slower
- Windows, Linux, or macOS
Note on
transformersversion: surya-ocr 0.17.x requirestransformers<5.0.0. This is pinned inrequirements.txt. Do not upgrade transformers to 5.x.
# Clone the repo
git clone https://github.com/scrowten/tarjim.git
cd tarjim
# Create environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate (Windows)
# Install dependencies
pip install -r requirements.txtOn first use, the following models are downloaded automatically and cached locally:
| Model | Size | What for |
|---|---|---|
| Surya OCR (text_recognition) | ~300 MB | Arabic text recognition |
| Surya OCR (text_detection) | ~73 MB | Text region detection |
| NLLB-200 CT2-INT8 | ~2.6 GB | Translation (primary) |
| NLLB-200 tokenizer | ~600 MB | Tokenizer for NLLB |
| CATT tashkeel | ~86 MB | Arabic diacritization (if enabled) |
After the first run, everything works fully offline.
Required for --show-tashkeel and --tashkeel-only to render harakat correctly. The Amiri font is recommended (OFL license, designed for classical Arabic):
# Download from https://github.com/aliftype/amiri/releases
# Place at: fonts/amiri-regular.ttfWithout Amiri, the tool falls back to system fonts: Segoe UI / Tahoma (Windows), Noto Naskh Arabic (Linux), Geeza Pro (macOS). Latin fonts (Times New Roman, Arial) have no Arabic glyph coverage and will produce a blank overlay — always use an Arabic-capable font.
# Arabic to English (NLLB-200, default)
python -m src.cli -i kitab.pdf -o kitab_en.pdf --lang en
# Arabic to Indonesian — direct, no pivot (NLLB-200)
python -m src.cli -i kitab.pdf -o kitab_id.pdf --lang id
# Enable tashkeel for better translation of undiacritized kitab text
python -m src.cli -i kitab.pdf -o kitab_en.pdf --lang en --tashkeel
# Show diacritized Arabic + translation stacked in output PDF
python -m src.cli -i kitab.pdf -o kitab_en.pdf --lang en --tashkeel --show-tashkeel
# Tashkeel-only overlay — diacritize Arabic and overlay it back, no translation
python -m src.cli -i kitab.pdf -o kitab_tashkeel.pdf --tashkeel-only
# Use Argos Translate as lightweight fallback (no GPU required)
python -m src.cli -i kitab.pdf -o kitab_en.pdf --lang en --translator argos
# Clean mode (white background, translated text only)
python -m src.cli -i kitab.pdf -o kitab_en.pdf --overlay-mode clean
# Verbose logging
python -m src.cli -i kitab.pdf -o kitab_id.pdf --lang id --verbose| Option | Default | Description |
|---|---|---|
--input, -i |
(required) | Path to input Arabic PDF |
--output, -o |
(required) | Path to save translated PDF |
--lang, -l |
en |
Target language code |
--source-lang |
ar |
Source language code |
--translator |
nllb |
nllb (high quality) or argos (lightweight) |
--tashkeel |
off | Diacritize Arabic before translation (CATT) |
--show-tashkeel |
off | Show diacritized Arabic in output PDF (bilingual) |
--tashkeel-only |
off | Diacritize and overlay Arabic only — no translation |
--dpi |
300 |
Rendering DPI |
--overlay-mode |
replace |
replace or clean |
--font |
(auto) | Path to .ttf font file |
--verbose, -v |
off | Enable debug logging |
# Start the web server
uvicorn src.api:app --host 0.0.0.0 --port 8000
# Open http://localhost:8000 in your browserThe web UI lets you upload a PDF, choose translator (NLLB-200 or Argos), target language, overlay mode, and tashkeel options (including tashkeel-only), then download the translated PDF.
docker build -t tarjim .
docker run -p 8000:8000 tarjimfrom src.core.pdf_handler import process_pdf
# Arabic to English (NLLB-200, default)
process_pdf(
input_path="kitab.pdf",
output_path="kitab_en.pdf",
target_lang="en",
overlay_mode="replace",
)
# Arabic to Indonesian (NLLB-200, direct — no English pivot)
process_pdf(
input_path="kitab.pdf",
output_path="kitab_id.pdf",
target_lang="id",
)
# Best quality: tashkeel + NLLB
process_pdf(
input_path="kitab.pdf",
output_path="kitab_en.pdf",
target_lang="en",
tashkeel=True, # diacritize Arabic before translation
show_tashkeel=False, # set True to show harakat in PDF
)
# Lightweight: Argos Translate fallback (CPU, no ~3 GB download)
process_pdf(
input_path="kitab.pdf",
output_path="kitab_en.pdf",
target_lang="en",
translator="argos",
)Meta's No Language Left Behind model, quantized to INT8 via CTranslate2.
| Property | Value |
|---|---|
| Parameters | 1.3B |
| Download size | ~3.2 GB (one-time) |
| VRAM usage | ~2.8 GB (int8_float16) |
| Languages | 200+ |
| ar→id | Direct (no pivot) |
| Quality vs Argos | ~3–5× better |
| License | CC-BY-NC 4.0 |
All language pairs are direct — NLLB-200 translates Arabic → any target in one pass.
Small OpenNMT-based models (~100M params). Useful for CPU-only environments or when the NLLB download is impractical.
| Property | Value |
|---|---|
| Parameters | ~74–100M |
| Download size | ~100–200 MB per language pair |
| VRAM usage | <0.5 GB |
| ar→id | Via English pivot (ar→en→id) |
| Quality | Lower than NLLB |
| License | MIT |
Most classical Arabic kitab are printed without harakat (short vowel marks). This causes ambiguity that hurts translation quality. Tarjim uses CATT (Contextual Arabic Tashkeel Transformer) to restore harakat.
CATT details:
- 2024 SOTA for Arabic diacritization
- Trained on 75M words from 97 classical books in the Shamela Digital Library (Tashkeela corpus)
- Specifically suited for classical Islamic Arabic (Quran, hadith, fiqh, tafsir)
- ~86 MB ONNX model, runs on CPU
Modes:
| Flag | Behavior |
|---|---|
| (none) | No diacritization. OCR output goes directly to translator. |
--tashkeel |
Diacritize silently before translation. Better translation quality. PDF unchanged. |
--tashkeel --show-tashkeel |
Diacritize + render diacritized Arabic in output PDF above translation (bilingual). |
--tashkeel-only |
No translation. Diacritize Arabic and overlay it back onto the PDF. Produces a fully-vowelized Arabic PDF. |
Arabic font for tashkeel rendering:
For --show-tashkeel and --tashkeel-only to render harakat correctly, an Arabic-capable font is needed. Priority order:
fonts/amiri-regular.ttf— recommended (best harakat rendering for classical kitab, OFL license)- System fonts — Windows: Segoe UI, Tahoma, Calibri, Arabic Typesetting; Linux: Noto Naskh Arabic; macOS: Geeza Pro
Without an Arabic font, harakat may not display correctly (Latin-only fonts like Times New Roman have no Arabic glyph coverage). Download Amiri Regular:
# Download from https://github.com/aliftype/amiri/releases
# Place at: fonts/amiri-regular.ttfStep 1 — PDF to Images: Each PDF page is rendered at 300 DPI using PyMuPDF, producing high-resolution PIL Images.
Step 2 — OCR with Surya: Surya OCR detects text regions and recognizes Arabic text, returning text lines with precise bounding boxes.
Step 3 — Tashkeel (optional): CATT EncoderDecoder restores harakat to each detected Arabic text line. The diacritized text feeds into the translator instead of raw OCR output.
Step 4 — Translation: Each text line is translated using the selected backend:
- NLLB-200: Direct translation to any of 200 languages. Uses CTranslate2 INT8 on GPU (auto-detected).
- Argos Translate: Direct ar→en, or auto-pivot ar→en→target for other languages.
Step 5 — Text Overlay: In replace mode, each original text region is covered with a white rectangle, then the translated text is drawn with dynamically-sized fonts and word-wrapping. If show_tashkeel is enabled, the diacritized Arabic is also rendered above the translation (RTL, Amiri font). In clean mode, a fresh white page is used.
Step 6 — Save PDF: All modified page images are combined into a new multi-page PDF.
tarjim/
├── README.md
├── requirements.txt
├── Dockerfile
├── src/
│ ├── cli.py # CLI entry point
│ ├── api.py # FastAPI web server
│ └── core/
│ ├── pdf_handler.py # Pipeline orchestrator + PDF I/O
│ ├── ocr_surya.py # Surya OCR wrapper (with transformers 4.x compat patch)
│ ├── tashkeel.py # CATT tashkeel wrapper
│ ├── translator_nllb.py # NLLB-200 translator (CTranslate2 INT8)
│ ├── translator_argos.py # Argos Translate wrapper (fallback)
│ └── utils.py # Overlay, font, RTL drawing helpers
├── static/
│ ├── index.html # Web UI
│ └── styles.css
├── fonts/
│ ├── times.ttf # Bundled Latin font
│ └── amiri-regular.ttf # (optional) Arabic font for tashkeel display
├── examples/
│ └── ... # Sample Arabic kitab PDFs
├── tests/
│ └── ...
└── docs/
└── logo.png
| Hardware | Surya OCR | NLLB Translation | Per page |
|---|---|---|---|
| RTX 3050 6GB | ~2.5 GB VRAM | ~2.8 GB VRAM | ~5–15 s |
| CPU only | Slow | ~30–60 s/line | Minutes |
VRAM budget (RTX 3050 6GB):
- Surya OCR: ~2.5 GB
- NLLB-200 CT2-INT8: ~2.8 GB
- CATT tashkeel: ~0.1 GB (CPU, ONNX)
- Total: ~5.3 GB — fits within 6 GB when loaded sequentially
Tips:
- Higher DPI (400) improves OCR accuracy on small print
--tashkeeladds ~2–5 s per page but meaningfully improves translation of undiacritized text- First run is slow (model downloads); subsequent runs are fully offline
- Phase 2: I'rab (Arabic morphosyntactic parsing) for language learners
- Phase 3: Desktop GUI (pywebview wrapper)
- Phase 4: Local server + Cloudflare Tunnel for remote access
Risky Agung Dwi Putranto
MIT License — free to use, modify, and share.
- Surya OCR — State-of-the-art document OCR
- CATT Tashkeel — Arabic diacritization (2024 SOTA)
- NLLB-200 — Meta's 200-language translation model
- CTranslate2 — Fast INT8 inference for seq2seq models
- Argos Translate — Offline lightweight translation
- PyMuPDF — PDF rendering and manipulation
- Pillow — Image processing
- arabic-reshaper + python-bidi — Arabic RTL text shaping
