Add Mistral OCR for PDF parsing with LaTeX math preservation#33
Open
r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
Open
Add Mistral OCR for PDF parsing with LaTeX math preservation#33r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
r-uben wants to merge 1 commit intoChicagoHAI:mainfrom
Conversation
…tion Integrates Mistral OCR as the primary PDF extraction engine, replacing pymupdf's lossy text extraction with structured markdown that preserves LaTeX equations, tables, and document structure (~$0.001/page). Changes: - Add _parse_pdf_mistral() with fallback chain: Mistral -> Marker -> PyMuPDF - Add `extract` subcommand for two-stage workflow (OCR once, review many) - Add OCR post-processor that auto-corrects visually confusable symbols - Add OCR caveat prompt so the reviewer distinguishes OCR artifacts from errors - parse_document() now returns (title, text, was_ocr) tuple - Extract command saves figures to disk and rewrites markdown image refs - YAML frontmatter on extracted files flags OCR source for downstream use - Add --ocr flag to both review and extract subcommands - Add 8 tests for OCR postprocessing and frontmatter detection
Contributor
|
Thanks for the PR! We also support marker, https://github.com/datalab-to/marker, is mistral OCR better? |
Contributor
Author
|
Thanks! Honestly, Mistral and Marker are pretty close on formula accuracy (~1.5 points on OmniDocBench). Mistral is faster and extracts figures, Marker is free and local. PaddleOCR-VL beats both by a lot. The idea here is just to start offering options and let users pick what works for their papers. Future PRs should add more providers (PaddleOCR-VL, Gemini, DeepSeek). Longer term I'm working on a robust multi-engine OCR pipeline at |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The current PDF parser (pymupdf) loses all mathematical notation — equations become garbled Unicode, subscripts disappear, and LaTeX structure is destroyed. This makes the reviewer unable to check items 1-4 in the review criteria (math errors, notation inconsistencies, text-vs-equation mismatches, parameter inconsistencies) on PDF inputs.
This PR integrates Mistral OCR as the primary PDF extraction engine. It converts PDFs to structured markdown that preserves LaTeX equations, tables, headers, and document structure at ~$0.001/page. The implementation follows the approach in r-uben/mistral-ocr-cli.
What's included
_parse_pdf_mistral): sends the PDF as base64, returns concatenated markdown with LaTeX math, tables, and extracted figuresextractsubcommand: two-stage workflow — run OCR once (openaireview extract paper.pdf), then review the markdown (openaireview review paper.md). This lets users inspect/fix OCR output before spending on review API calls\hat{t}→\hat{i}whentappears once butiappears many times with the same accent command)ocr_engine,title,source,extract_date) so downstream review detects OCR provenance automaticallyfigures/and rewrites markdown references--ocrflag on bothreviewandextractsubcommands to force a specific engineConcrete improvement
On a 48-page economics paper (Nakamura & Steinsson 2018, QJE):
$\hat{i}_t$)Breaking change
parse_document()now returns(title, text, was_ocr)instead of(title, text). All internal callers are updated. External callers (if any) need to unpack the third value.Test plan
MISTRAL_API_KEYset (should fall through to Marker/pymupdf)