Add Mistral OCR for PDF parsing with LaTeX math preservation by r-uben · Pull Request #33 · ChicagoHAI/OpenAIReview

r-uben · 2026-03-10T22:02:14Z

Summary

The current PDF parser (pymupdf) loses all mathematical notation — equations become garbled Unicode, subscripts disappear, and LaTeX structure is destroyed. This makes the reviewer unable to check items 1-4 in the review criteria (math errors, notation inconsistencies, text-vs-equation mismatches, parameter inconsistencies) on PDF inputs.

This PR integrates Mistral OCR as the primary PDF extraction engine. It converts PDFs to structured markdown that preserves LaTeX equations, tables, headers, and document structure at ~$0.001/page. The implementation follows the approach in r-uben/mistral-ocr-cli.

What's included

Mistral OCR parser (_parse_pdf_mistral): sends the PDF as base64, returns concatenated markdown with LaTeX math, tables, and extracted figures
Fallback chain: Mistral OCR → Marker → PyMuPDF (graceful degradation, no breaking change)
extract subcommand: two-stage workflow — run OCR once (openaireview extract paper.pdf), then review the markdown (openaireview review paper.md). This lets users inspect/fix OCR output before spending on review API calls
OCR post-processor: auto-corrects visually confusable symbols (e.g., \hat{t} → \hat{i} when t appears once but i appears many times with the same accent command)
OCR caveat prompt: when reviewing OCR'd content, a prompt caveat tells the model to distinguish OCR artifacts from author errors
YAML frontmatter: extracted files include metadata (ocr_engine, title, source, extract_date) so downstream review detects OCR provenance automatically
Figure extraction: saves embedded images to figures/ and rewrites markdown references
--ocr flag on both review and extract subcommands to force a specific engine
8 new tests for the OCR postprocessor and frontmatter detection

Concrete improvement

On a 48-page economics paper (Nakamura & Steinsson 2018, QJE):

	pymupdf	Mistral OCR
Equation quality	Garbled Unicode	Clean LaTeX ( $\hat{i}_t$ )
Title extraction	Wrong (grabbed header)	Correct
Tables	Raw text dump	Markdown tables
Figures	Lost	9 images extracted
Cost	Free	~$0.05

Breaking change

parse_document() now returns (title, text, was_ocr) instead of (title, text). All internal callers are updated. External callers (if any) need to unpack the third value.

Test plan

All 30 existing tests pass (including 2 integration tests)
8 new OCR-specific tests pass
Tested end-to-end on 48-page PDF with figure extraction
Verify fallback works without MISTRAL_API_KEY set (should fall through to Marker/pymupdf)

…tion Integrates Mistral OCR as the primary PDF extraction engine, replacing pymupdf's lossy text extraction with structured markdown that preserves LaTeX equations, tables, and document structure (~$0.001/page). Changes: - Add _parse_pdf_mistral() with fallback chain: Mistral -> Marker -> PyMuPDF - Add `extract` subcommand for two-stage workflow (OCR once, review many) - Add OCR post-processor that auto-corrects visually confusable symbols - Add OCR caveat prompt so the reviewer distinguishes OCR artifacts from errors - parse_document() now returns (title, text, was_ocr) tuple - Extract command saves figures to disk and rewrites markdown image refs - YAML frontmatter on extracted files flags OCR source for downstream use - Add --ocr flag to both review and extract subcommands - Add 8 tests for OCR postprocessing and frontmatter detection

chenhaot · 2026-03-11T01:05:35Z

Thanks for the PR! We also support marker, https://github.com/datalab-to/marker, is mistral OCR better?

r-uben · 2026-03-11T12:15:47Z

Thanks! Honestly, Mistral and Marker are pretty close on formula accuracy (~1.5 points on OmniDocBench). Mistral is faster and extracts figures, Marker is free and local. PaddleOCR-VL beats both by a lot.

The idea here is just to start offering options and let users pick what works for their papers. Future PRs should add more providers (PaddleOCR-VL, Gemini, DeepSeek). Longer term I'm working on a robust multi-engine OCR pipeline at r-uben/smart-ocr with per-page quality checks and automatic fallback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mistral OCR for PDF parsing with LaTeX math preservation#33

Add Mistral OCR for PDF parsing with LaTeX math preservation#33
r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
r-uben:feat/mistral-ocr

r-uben commented Mar 10, 2026 •

edited

Loading

Uh oh!

chenhaot commented Mar 11, 2026

Uh oh!

r-uben commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r-uben commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Concrete improvement

Breaking change

Test plan

Uh oh!

chenhaot commented Mar 11, 2026

Uh oh!

r-uben commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r-uben commented Mar 10, 2026 •

edited

Loading