An automated PDF remediation tool from the CUNY AI Lab that converts uploaded PDFs into accessible, PDF/UA-1 compliant documents.
Upload a PDF and the app automatically remediates it through a multi-step pipeline: classification, OCR, structure extraction, semantic analysis, accessible tagging, and validation. Output is gated by veraPDF compliance checks and fidelity analysis to ensure quality. Documents that can't be fully remediated are flagged for manual review.
- Classify — Determine whether the PDF is digital, mixed, or scanned
- OCR — Add searchable text to scanned pages (OCRmyPDF) with automatic language detection
- Structure — Extract document structure via Docling, with LLM-assisted TOC enhancement
- Alt Text — Generate alt text for figures and reclassify misidentified elements using a vision LLM
- Tag — Resolve ambiguous semantics (tables, forms, reading order, grounded text) via LLM, then write PDF/UA structure tags deterministically with pikepdf
- Validate — Check PDF/UA-1 compliance with veraPDF
- Fidelity — Verify output faithfulness (text drift, reading order, table coverage, form labels)
| Layer | Technology |
|---|---|
| Backend | Python 3.12, FastAPI, SQLAlchemy (async SQLite) |
| Frontend | React, TypeScript, Vite, Tailwind CSS 4, TanStack Query |
| PDF Processing | pikepdf, OCRmyPDF, Ghostscript, Poppler, QPDF |
| Structure Extraction | Docling (local or docling-serve) |
| Semantic Analysis | Gemini Developer API (gemini-3-flash-preview) |
| OCR | OCRmyPDF, Tesseract |
| Validation | veraPDF |
- Python 3.12+ and uv
- Bun
- Ghostscript
- OCRmyPDF
- Tesseract (used by OCRmyPDF and for local crop OCR)
- veraPDF (requires Java runtime)
- Poppler (
pdftoppm)
On macOS: install via Homebrew. On Ubuntu/Debian: ghostscript, poppler-utils, tesseract-ocr, plus a Java runtime for veraPDF.
cp .env.example .env
# Edit .env — at minimum, set GEMINI_API_KEYKey environment variables:
| Variable | Description | Default |
|---|---|---|
GEMINI_API_KEY |
Google Gemini API key for direct PDF understanding and fallback chat-completions calls | — |
LLM_BASE_URL |
Gemini Developer API chat-completions base URL | https://generativelanguage.googleapis.com/v1beta/openai |
LLM_API_KEY |
Optional override for the chat-completions client; falls back to GEMINI_API_KEY when unset |
— |
LLM_MODEL |
Model identifier | google/gemini-3-flash-preview |
GEMINI_MODEL |
Direct Gemini model identifier for native PDF lanes | gemini-3-flash-preview |
GEMINI_DIRECT_THINKING_LEVEL |
Default Gemini thinking level for direct PDF semantic lanes | low |
GEMINI_DIRECT_ALT_TEXT_THINKING_LEVEL |
Gemini thinking level override for figure semantics and alt text | medium |
ALT_TEXT_MAX_CONCURRENCY |
Maximum concurrent page-level figure/alt-text LLM requests per PDF | 8 |
ALT_TEXT_GLOBAL_MAX_CONCURRENCY |
Process-wide cap for concurrent figure/alt-text provider work across PDFs | 12 |
DOCLING_SERVE_URL |
Local or remote docling-serve URL for structure extraction |
— |
DOCLING_SERVE_TOKEN |
Optional bearer token for a protected docling-serve proxy |
— |
OCR_LANGUAGE |
Default Tesseract language code | eng |
JOB_TTL_HOURS |
Hours before jobs expire | 12 |
VERAPDF_PATH |
Path to veraPDF binary | verapdf |
GHOSTSCRIPT_PATH |
Path to Ghostscript binary | gs |
cd backend && uv sync
cd ../frontend && bun install# Terminal 1 — backend
cd backend
uv run uvicorn app.main:app --reload --port 8001
# Terminal 2 — frontend
cd frontend
bun dev- Frontend: http://localhost:5173
- Backend API: http://localhost:8001
The frontend proxies /api and /health to the backend via Vite config.
For the main app on this machine, the intended setup is:
- LLM semantics through the Gemini Developer API
- structure extraction through local
docling-serve - Apple GPU acceleration through MPS on the
docling-serveprocess when available
Set DOCLING_SERVE_URL=http://localhost:5001 in .env, and start docling-serve with DOCLING_DEVICE=mps. The structure step will use that server. The later PDF tagging/writing step is still local CPU work.
You can verify the effective runtime with:
cd backend
PYTHONPATH=. uv run python scripts/runtime_diagnostics.pyA single-container deployment bundles all dependencies (Ghostscript, OCRmyPDF, Tesseract, Poppler, QPDF, Java, veraPDF) with the built frontend served by FastAPI.
cp .env.example .env
# Edit .env with your GEMINI_API_KEY
# Leave LLM_API_KEY empty unless you intentionally want a different
# chat-completions credential than the Gemini Developer API key.
docker compose up -d --buildOpen http://localhost:8080. Health check at /health.
If port 8080 is in use, set APP_PORT in .env.
You can also run the image directly without Compose:
docker build -t pdf-accessibility-app .
docker run -d \
--name pdf-accessibility-app \
--env-file .env \
-p 8080:8001 \
-v pdf_accessibility_data:/app/data \
-v pdf_accessibility_cache:/home/app/.cache \
pdf-accessibility-appNotes:
- The image preloads Docling models so there are no first-run downloads.
- The intended Gemini-first deployment shape is:
GEMINI_API_KEY=<key>,LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai,LLM_API_KEY=blank,LLM_MODEL=google/gemini-3-flash-preview,GEMINI_MODEL=gemini-3-flash-preview,USE_DIRECT_GEMINI_PDF=true,GEMINI_DIRECT_THINKING_LEVEL=low,GEMINI_DIRECT_ALT_TEXT_THINKING_LEVEL=medium,ALT_TEXT_MAX_CONCURRENCY=8, andALT_TEXT_GLOBAL_MAX_CONCURRENCY=12. - For subpath deployments, set
VITE_APP_BASE_PATHbefore building (e.g.,/pdf-accessibility/). - Tesseract language packs included: English, Spanish, French, German, Chinese (Simplified + Traditional), Russian, Arabic, Korean, Bengali, Polish, Hebrew, Yiddish, Haitian Creole, Hindi, Italian, Portuguese, Japanese. Add others by extending the Dockerfile.
backend/
app/
api/ # FastAPI route handlers
pipeline/ # classify, ocr, structure, tag, validate, fidelity
services/ # semantic adjudication, storage, LLM client
models.py # SQLAlchemy ORM models
config.py # App settings
tests/ # Backend test suite
frontend/
src/
pages/ # Upload, Dashboard, JobDetail, Review
components/ # UI components
api/ # TanStack Query hooks
types/ # Shared TypeScript types
data/ # Runtime storage (git-ignored)
# Backend
cd backend
PYTHONPATH=. uv run pytest tests -q
# Frontend
cd frontend
bun run lint
bun run buildThe app auto-detects the document language during classification and selects the appropriate Tesseract language pack for OCR. For digital/mixed PDFs, it extracts existing text and identifies the language with lingua-py. For scanned PDFs, it runs a quick probe OCR on page 1 with all installed language packs, then identifies the language from the result.
Language priority: auto-detection > OCR_LANGUAGE env var default.
For local development, install Tesseract language packs via your package manager. On macOS, brew install tesseract-lang installs all languages. On Debian/Ubuntu, install individual packs (e.g., apt install tesseract-ocr-spa). If a language pack is missing, probe OCR falls back gracefully to the OCR_LANGUAGE default.
The app uses anonymous browser sessions — no login required. Each browser gets an HTTP-only session cookie, and all jobs are scoped to that session. Jobs expire after JOB_TTL_HOURS (default: 12 hours).