Skip to content

Add Mistral OCR for PDF parsing with LaTeX math preservation#33

Open
r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
r-uben:feat/mistral-ocr
Open

Add Mistral OCR for PDF parsing with LaTeX math preservation#33
r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
r-uben:feat/mistral-ocr

Conversation

@r-uben
Copy link
Contributor

@r-uben r-uben commented Mar 10, 2026

Summary

The current PDF parser (pymupdf) loses all mathematical notation — equations become garbled Unicode, subscripts disappear, and LaTeX structure is destroyed. This makes the reviewer unable to check items 1-4 in the review criteria (math errors, notation inconsistencies, text-vs-equation mismatches, parameter inconsistencies) on PDF inputs.

This PR integrates Mistral OCR as the primary PDF extraction engine. It converts PDFs to structured markdown that preserves LaTeX equations, tables, headers, and document structure at ~$0.001/page. The implementation follows the approach in r-uben/mistral-ocr-cli.

What's included

  • Mistral OCR parser (_parse_pdf_mistral): sends the PDF as base64, returns concatenated markdown with LaTeX math, tables, and extracted figures
  • Fallback chain: Mistral OCR → Marker → PyMuPDF (graceful degradation, no breaking change)
  • extract subcommand: two-stage workflow — run OCR once (openaireview extract paper.pdf), then review the markdown (openaireview review paper.md). This lets users inspect/fix OCR output before spending on review API calls
  • OCR post-processor: auto-corrects visually confusable symbols (e.g., \hat{t}\hat{i} when t appears once but i appears many times with the same accent command)
  • OCR caveat prompt: when reviewing OCR'd content, a prompt caveat tells the model to distinguish OCR artifacts from author errors
  • YAML frontmatter: extracted files include metadata (ocr_engine, title, source, extract_date) so downstream review detects OCR provenance automatically
  • Figure extraction: saves embedded images to figures/ and rewrites markdown references
  • --ocr flag on both review and extract subcommands to force a specific engine
  • 8 new tests for the OCR postprocessor and frontmatter detection

Concrete improvement

On a 48-page economics paper (Nakamura & Steinsson 2018, QJE):

pymupdf Mistral OCR
Equation quality Garbled Unicode Clean LaTeX ($\hat{i}_t$)
Title extraction Wrong (grabbed header) Correct
Tables Raw text dump Markdown tables
Figures Lost 9 images extracted
Cost Free ~$0.05

Breaking change

parse_document() now returns (title, text, was_ocr) instead of (title, text). All internal callers are updated. External callers (if any) need to unpack the third value.

Test plan

  • All 30 existing tests pass (including 2 integration tests)
  • 8 new OCR-specific tests pass
  • Tested end-to-end on 48-page PDF with figure extraction
  • Verify fallback works without MISTRAL_API_KEY set (should fall through to Marker/pymupdf)

…tion

Integrates Mistral OCR as the primary PDF extraction engine, replacing
pymupdf's lossy text extraction with structured markdown that preserves
LaTeX equations, tables, and document structure (~$0.001/page).

Changes:
- Add _parse_pdf_mistral() with fallback chain: Mistral -> Marker -> PyMuPDF
- Add `extract` subcommand for two-stage workflow (OCR once, review many)
- Add OCR post-processor that auto-corrects visually confusable symbols
- Add OCR caveat prompt so the reviewer distinguishes OCR artifacts from errors
- parse_document() now returns (title, text, was_ocr) tuple
- Extract command saves figures to disk and rewrites markdown image refs
- YAML frontmatter on extracted files flags OCR source for downstream use
- Add --ocr flag to both review and extract subcommands
- Add 8 tests for OCR postprocessing and frontmatter detection
@chenhaot
Copy link
Contributor

Thanks for the PR! We also support marker, https://github.com/datalab-to/marker, is mistral OCR better?

@r-uben
Copy link
Contributor Author

r-uben commented Mar 11, 2026

Thanks! Honestly, Mistral and Marker are pretty close on formula accuracy (~1.5 points on OmniDocBench). Mistral is faster and extracts figures, Marker is free and local. PaddleOCR-VL beats both by a lot.

The idea here is just to start offering options and let users pick what works for their papers. Future PRs should add more providers (PaddleOCR-VL, Gemini, DeepSeek). Longer term I'm working on a robust multi-engine OCR pipeline at r-uben/smart-ocr with per-page quality checks and automatic fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants