Skip to content

CUNY-AI-Lab/pdf-accessibility-app

Repository files navigation

PDF Accessibility App

An automated PDF remediation tool from the CUNY AI Lab that converts uploaded PDFs into accessible, PDF/UA-1 compliant documents.

Overview

Upload a PDF and the app automatically remediates it through a multi-step pipeline: classification, OCR, structure extraction, semantic analysis, accessible tagging, and validation. Output is gated by veraPDF compliance checks and fidelity analysis to ensure quality. Documents that can't be fully remediated are flagged for manual review.

Pipeline

  1. Classify — Determine whether the PDF is digital, mixed, or scanned
  2. OCR — Add searchable text to scanned pages (OCRmyPDF) with automatic language detection
  3. Structure — Extract document structure via Docling, with LLM-assisted TOC enhancement
  4. Alt Text — Generate alt text for figures and reclassify misidentified elements using a vision LLM
  5. Tag — Resolve ambiguous semantics (tables, forms, reading order, grounded text) via LLM, then write PDF/UA structure tags deterministically with pikepdf
  6. Validate — Check PDF/UA-1 compliance with veraPDF
  7. Fidelity — Verify output faithfulness (text drift, reading order, table coverage, form labels)

Tech Stack

Layer Technology
Backend Python 3.12, FastAPI, SQLAlchemy (async SQLite)
Frontend React, TypeScript, Vite, Tailwind CSS 4, TanStack Query
PDF Processing pikepdf, OCRmyPDF, Ghostscript, Poppler, QPDF
Structure Extraction Docling (local or docling-serve)
Semantic Analysis Gemini Developer API (gemini-3-flash-preview)
OCR OCRmyPDF, Tesseract
Validation veraPDF

Prerequisites

On macOS: install via Homebrew. On Ubuntu/Debian: ghostscript, poppler-utils, tesseract-ocr, plus a Java runtime for veraPDF.

Getting Started

1. Configure environment

cp .env.example .env
# Edit .env — at minimum, set GEMINI_API_KEY

Key environment variables:

Variable Description Default
GEMINI_API_KEY Google Gemini API key for direct PDF understanding and fallback chat-completions calls
LLM_BASE_URL Gemini Developer API chat-completions base URL https://generativelanguage.googleapis.com/v1beta/openai
LLM_API_KEY Optional override for the chat-completions client; falls back to GEMINI_API_KEY when unset
LLM_MODEL Model identifier google/gemini-3-flash-preview
GEMINI_MODEL Direct Gemini model identifier for native PDF lanes gemini-3-flash-preview
GEMINI_DIRECT_THINKING_LEVEL Default Gemini thinking level for direct PDF semantic lanes low
GEMINI_DIRECT_ALT_TEXT_THINKING_LEVEL Gemini thinking level override for figure semantics and alt text medium
ALT_TEXT_MAX_CONCURRENCY Maximum concurrent page-level figure/alt-text LLM requests per PDF 8
ALT_TEXT_GLOBAL_MAX_CONCURRENCY Process-wide cap for concurrent figure/alt-text provider work across PDFs 12
DOCLING_SERVE_URL Local or remote docling-serve URL for structure extraction
DOCLING_SERVE_TOKEN Optional bearer token for a protected docling-serve proxy
OCR_LANGUAGE Default Tesseract language code eng
JOB_TTL_HOURS Hours before jobs expire 12
VERAPDF_PATH Path to veraPDF binary verapdf
GHOSTSCRIPT_PATH Path to Ghostscript binary gs

2. Install dependencies

cd backend && uv sync
cd ../frontend && bun install

3. Run locally

# Terminal 1 — backend
cd backend
uv run uvicorn app.main:app --reload --port 8001

# Terminal 2 — frontend
cd frontend
bun dev

The frontend proxies /api and /health to the backend via Vite config.

Recommended Mac Runtime

For the main app on this machine, the intended setup is:

  • LLM semantics through the Gemini Developer API
  • structure extraction through local docling-serve
  • Apple GPU acceleration through MPS on the docling-serve process when available

Set DOCLING_SERVE_URL=http://localhost:5001 in .env, and start docling-serve with DOCLING_DEVICE=mps. The structure step will use that server. The later PDF tagging/writing step is still local CPU work.

You can verify the effective runtime with:

cd backend
PYTHONPATH=. uv run python scripts/runtime_diagnostics.py

Docker

A single-container deployment bundles all dependencies (Ghostscript, OCRmyPDF, Tesseract, Poppler, QPDF, Java, veraPDF) with the built frontend served by FastAPI.

cp .env.example .env
# Edit .env with your GEMINI_API_KEY
# Leave LLM_API_KEY empty unless you intentionally want a different
# chat-completions credential than the Gemini Developer API key.

docker compose up -d --build

Open http://localhost:8080. Health check at /health.

If port 8080 is in use, set APP_PORT in .env.

You can also run the image directly without Compose:

docker build -t pdf-accessibility-app .
docker run -d \
  --name pdf-accessibility-app \
  --env-file .env \
  -p 8080:8001 \
  -v pdf_accessibility_data:/app/data \
  -v pdf_accessibility_cache:/home/app/.cache \
  pdf-accessibility-app

Notes:

  • The image preloads Docling models so there are no first-run downloads.
  • The intended Gemini-first deployment shape is: GEMINI_API_KEY=<key>, LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai, LLM_API_KEY= blank, LLM_MODEL=google/gemini-3-flash-preview, GEMINI_MODEL=gemini-3-flash-preview, USE_DIRECT_GEMINI_PDF=true, GEMINI_DIRECT_THINKING_LEVEL=low, GEMINI_DIRECT_ALT_TEXT_THINKING_LEVEL=medium, ALT_TEXT_MAX_CONCURRENCY=8, and ALT_TEXT_GLOBAL_MAX_CONCURRENCY=12.
  • For subpath deployments, set VITE_APP_BASE_PATH before building (e.g., /pdf-accessibility/).
  • Tesseract language packs included: English, Spanish, French, German, Chinese (Simplified + Traditional), Russian, Arabic, Korean, Bengali, Polish, Hebrew, Yiddish, Haitian Creole, Hindi, Italian, Portuguese, Japanese. Add others by extending the Dockerfile.

Project Structure

backend/
  app/
    api/              # FastAPI route handlers
    pipeline/         # classify, ocr, structure, tag, validate, fidelity
    services/         # semantic adjudication, storage, LLM client
    models.py         # SQLAlchemy ORM models
    config.py         # App settings
  tests/              # Backend test suite

frontend/
  src/
    pages/            # Upload, Dashboard, JobDetail, Review
    components/       # UI components
    api/              # TanStack Query hooks
    types/            # Shared TypeScript types

data/                 # Runtime storage (git-ignored)

Testing

# Backend
cd backend
PYTHONPATH=. uv run pytest tests -q

# Frontend
cd frontend
bun run lint
bun run build

OCR Language Support

The app auto-detects the document language during classification and selects the appropriate Tesseract language pack for OCR. For digital/mixed PDFs, it extracts existing text and identifies the language with lingua-py. For scanned PDFs, it runs a quick probe OCR on page 1 with all installed language packs, then identifies the language from the result.

Language priority: auto-detection > OCR_LANGUAGE env var default.

For local development, install Tesseract language packs via your package manager. On macOS, brew install tesseract-lang installs all languages. On Debian/Ubuntu, install individual packs (e.g., apt install tesseract-ocr-spa). If a language pack is missing, probe OCR falls back gracefully to the OCR_LANGUAGE default.

Session Model

The app uses anonymous browser sessions — no login required. Each browser gets an HTTP-only session cookie, and all jobs are scoped to that session. Jobs expire after JOB_TTL_HOURS (default: 12 hours).

Documentation

About

Execution-first PDF accessibility remediation app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors