Yueru He, Xueqing Peng*, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Shuyao Wang, Ruoyu Xiang, Fan Zhang, Zhuohan Xie, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou
*Corresponding author, xueqing.peng2024@gmail.com
📖 Paper • 🤗 Dataset • 🌏 WebPage
Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong OCR performance on surface metrics does not necessarily imply faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors,such as a missing negative marker, shifted decimal point, incorrect unit scale, or misaligned reporting date, can induce materially different interpretations.
To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating OCR and vision-language systems through the lens of evidence fidelity in high-stakes document understanding. FinCriticalED contains 859 real-world financial document pages paired with ground-truth HTML, with 9,481 expert-annotated facts spanning five financially critical field types: Numbers, Monetary Units, Temporal Data, Reporting Entities, and Financial Concepts.
We further develop an evaluation suite, including critical-field-aware metrics and a context-aware protocol, to assess whether model outputs preserve financially critical facts beyond lexical similarity. We benchmark 13 OCR pipelines, OCR-native models, open-source VLMs, and proprietary MLLMs on FinCriticalED. Results show that conventional OCR metrics can substantially overestimate factual reliability, and that OCR-specialized systems may outperform much larger general-purpose MLLMs in preserving critical financial evidence under complex layouts. FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a broader testbed for high-stakes multimodal document understanding.
Model performance on FinCriticalED benchmark:
| Model | Size | General (%) | Fact-Level (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 | RL | E↓ | Rank | N-FFA | T-FFA | M-FFA | R-FFA | FC-FFA | FFA | Rank | ||
| OCR Pipelines | ||||||||||||
| MinerU2.5 | 1.2B | - | - | - | - | - | - | - | - | - | - | - |
| PP-OCRv5 | 0.07B | 97.54 | 96.55 | 3.10 | - | - | - | - | - | - | - | - |
| Specialized OCR VLMs | ||||||||||||
| DeepSeek-OCR | 6B | - | - | - | - | - | - | - | - | - | - | - |
| DeepSeek-OCR-2 | 3B | - | - | - | - | - | - | - | - | - | - | - |
| GLM-OCR | 0.9B | - | - | - | - | - | - | - | - | - | - | - |
| Open-source MLLMs | ||||||||||||
| Gemma-3n-E4B-it | 4B | 83.49 | 79.59 | 23.82 | - | - | - | - | - | - | - | - |
| Qwen3-VL-8B-Instruct | 8B | - | - | - | - | - | - | - | - | - | - | - |
| Llama-4-Maverick | 17B | 98.00 | 97.62 | 3.70 | - | - | - | - | - | - | - | - |
| Qwen3.5-397B-A17B | 397B | - | - | - | - | - | - | - | - | - | - | - |
| Proprietary MLLMs | ||||||||||||
| GPT-4o | - | - | - | - | - | - | - | - | - | - | - | - |
| GPT-5 | - | - | - | - | - | - | - | - | - | - | - | - |
| Claude-Sonnet-4.6 | - | 98.84 | 98.73 | 1.69 | - | - | - | - | - | - | - | - |
| Gemini-2.5-Pro | - | - | - | - | - | - | - | - | - | - | - | - |
R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy.
-= results pending.
1. Before running models, configure the MODELS list in main.py by uncommenting the model(s) you want to run.
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
TOGETHER_API_KEY=...
ZAI_API_KEY = ...cd model_eval
python main.py-
Results are saved to
results/{model_tag}_zero-shot/pred_{i}.txt. Already-completed samples are skipped on re-runs. -
To limit the number of samples for a quick test, set max_samples in main():
max_samples = 10
- Supported Models Cloud VLMs (API key required, no local setup):
| Model key in MODELS | Provider | API Key |
|---|---|---|
| gpt-4o | OpenAI | OPENAI_API_KEY |
| gpt-5 | OpenAI | OPENAI_API_KEY |
| claude-sonnet-4-6 | Anthropic | ANTHROPIC_API_KEY |
| gemini-2.5-pro | GOOGLE_API_KEY | |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | Together AI | TOGETHER_API_KEY |
| Qwen/Qwen2.5-VL-72B-Instruct | Together AI | TOGETHER_API_KEY |
| glm-ocr | ZAI | ZAI_API_KEY |
- Local OCR Pipelines (no API key, local setup required):
| Model key in MODELS | Description |
|---|---|
| paddleocrv5 | PP-OCRv5 plain text OCR — outputs raw text lines |
| monkeyocr | MonkeyOCR — requires a running local server |
| mineru | MinerU2.5 — runs locally via HuggingFace |
| deepseekocr | DeepSeek-OCR — runs locally via HuggingFace |
| deepseekocr2 | DeepSeek-OCR — runs locally via HuggingFace |
- Setting up paddleocr and paddleocrv5-ppstructure Linux or WSL is required. PaddlePaddle has limited Windows support.
Install dependencies:
pip install paddleocr paddlepaddle
# For paddleocrv5-ppstructure, also install:
pip install shapely pyclipper scikit-image imutils lmdbModel weights (~200 MB) are downloaded automatically on first run.
paddleocrv5 runs PP-OCRv5 locally on CPU and outputs plain text. Some samples may raise a PaddlePaddle oneDNN/PIR compatibility error — these are caught and skipped automatically. No configuration needed beyond the install above.
paddleocrv5-table uses TableRecognitionPipelineV2 to detect tables and output them as HTML (<table>/<tr>/<td>), with all other text as
tags — producing a full HTML document per page. Note: PPStructureV3 requires CUDA and is not supported in CPU-only environments.
- Setting up monkeyocr
MonkeyOCR requires a running HTTP server. Start the server separately before running main.py (see the MonkeyOCR repository for server setup instructions).
Set the server URL via environment variable (defaults to http://localhost:8000):
MONKEYOCR_API_URL=http://your-server:8000
Upon running main.py, run evaluation.py to compute ROUGE-1, ROUGE-L, and Edit Distance metrics:
python evaluation.pyResults are saved as results/{model_tag}_zero-shot_rouge1_eval.csv.
In llm-as-a-judge-prompt.py, GPT-4o serves as the evaluator responsible for extracting financial entities (Numbers, Dates, Monetary Units, etc.) from the ground-truth HTML and verifying their presence in the model-generated HTML. The LLM Judge performs normalization, contextual matching, and fine-grained fact checking under a structured evaluation prompt.
If you find this work useful, please cite:
@misc{he2025fincriticaledvisualbenchmarkfinancial,
title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation},
author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Ruoyu Xiang and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
year={2025},
eprint={2511.14998},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.14998},
}