FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

Yueru He, Xueqing Peng*, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Shuyao Wang, Ruoyu Xiang, Fan Zhang, Zhuohan Xie, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

^*Corresponding author, xueqing.peng2024@gmail.com

📖 Paper • 🤗 Dataset • 🌏 WebPage

📜Abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong OCR performance on surface metrics does not necessarily imply faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors，such as a missing negative marker, shifted decimal point, incorrect unit scale, or misaligned reporting date, can induce materially different interpretations.

To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating OCR and vision-language systems through the lens of evidence fidelity in high-stakes document understanding. FinCriticalED contains 859 real-world financial document pages paired with ground-truth HTML, with 9,481 expert-annotated facts spanning five financially critical field types: Numbers, Monetary Units, Temporal Data, Reporting Entities, and Financial Concepts.

We further develop an evaluation suite, including critical-field-aware metrics and a context-aware protocol, to assess whether model outputs preserve financially critical facts beyond lexical similarity. We benchmark 13 OCR pipelines, OCR-native models, open-source VLMs, and proprietary MLLMs on FinCriticalED. Results show that conventional OCR metrics can substantially overestimate factual reliability, and that OCR-specialized systems may outperform much larger general-purpose MLLMs in preserving critical financial evidence under complex layouts. FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a broader testbed for high-stakes multimodal document understanding.

🏆Results

Model performance on FinCriticalED benchmark:

Model	Size	General (%)				Fact-Level (%)
Model	Size	R1	RL	E↓	Rank	N-FFA	T-FFA	M-FFA	R-FFA	FC-FFA	FFA	Rank
OCR Pipelines
MinerU2.5	1.2B	-	-	-	-	-	-	-	-	-	-	-
PP-OCRv5	0.07B	97.54	96.55	3.10	-	-	-	-	-	-	-	-
Specialized OCR VLMs
DeepSeek-OCR	6B	-	-	-	-	-	-	-	-	-	-	-
DeepSeek-OCR-2	3B	-	-	-	-	-	-	-	-	-	-	-
GLM-OCR	0.9B	-	-	-	-	-	-	-	-	-	-	-
Open-source MLLMs
Gemma-3n-E4B-it	4B	83.49	79.59	23.82	-	-	-	-	-	-	-	-
Qwen3-VL-8B-Instruct	8B	-	-	-	-	-	-	-	-	-	-	-
Llama-4-Maverick	17B	98.00	97.62	3.70	-	-	-	-	-	-	-	-
Qwen3.5-397B-A17B	397B	-	-	-	-	-	-	-	-	-	-	-
Proprietary MLLMs
GPT-4o	-	-	-	-	-	-	-	-	-	-	-	-
GPT-5	-	-	-	-	-	-	-	-	-	-	-	-
Claude-Sonnet-4.6	-	98.84	98.73	1.69	-	-	-	-	-	-	-	-
Gemini-2.5-Pro	-	-	-	-	-	-	-	-	-	-	-	-

R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy. - = results pending.

⚙️Usage

1. Running Models

1. Before running models, configure the `MODELS` list in `main.py` by uncommenting the model(s) you want to run.

2. Create a `.env` file in `model_eval/` with the relevant API keys:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
TOGETHER_API_KEY=...
ZAI_API_KEY = ...

3. Run `main.py` to generate model OCR output:

cd model_eval
python main.py

Results are saved to results/{model_tag}_zero-shot/pred_{i}.txt. Already-completed samples are skipped on re-runs.
To limit the number of samples for a quick test, set max_samples in main():

max_samples = 10

Supported Models Cloud VLMs (API key required, no local setup):

Model key in MODELS	Provider	API Key
gpt-4o	OpenAI	OPENAI_API_KEY
gpt-5	OpenAI	OPENAI_API_KEY
claude-sonnet-4-6	Anthropic	ANTHROPIC_API_KEY
gemini-2.5-pro	Google	GOOGLE_API_KEY
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	Together AI	TOGETHER_API_KEY
Qwen/Qwen2.5-VL-72B-Instruct	Together AI	TOGETHER_API_KEY
glm-ocr	ZAI	ZAI_API_KEY

Local OCR Pipelines (no API key, local setup required):

Model key in MODELS	Description
paddleocrv5	PP-OCRv5 plain text OCR — outputs raw text lines
monkeyocr	MonkeyOCR — requires a running local server
mineru	MinerU2.5 — runs locally via HuggingFace
deepseekocr	DeepSeek-OCR — runs locally via HuggingFace
deepseekocr2	DeepSeek-OCR — runs locally via HuggingFace

Setting up paddleocr and paddleocrv5-ppstructure Linux or WSL is required. PaddlePaddle has limited Windows support.

Install dependencies:

pip install paddleocr paddlepaddle
# For paddleocrv5-ppstructure, also install:
pip install shapely pyclipper scikit-image imutils lmdb

Model weights (~200 MB) are downloaded automatically on first run.

paddleocrv5 runs PP-OCRv5 locally on CPU and outputs plain text. Some samples may raise a PaddlePaddle oneDNN/PIR compatibility error — these are caught and skipped automatically. No configuration needed beyond the install above.

paddleocrv5-table uses TableRecognitionPipelineV2 to detect tables and output them as HTML (<table>/<tr>/<td>), with all other text as

tags — producing a full HTML document per page. Note: PPStructureV3 requires CUDA and is not supported in CPU-only environments.

Setting up monkeyocr

MonkeyOCR requires a running HTTP server. Start the server separately before running main.py (see the MonkeyOCR repository for server setup instructions).

Set the server URL via environment variable (defaults to http://localhost:8000):

MONKEYOCR_API_URL=http://your-server:8000

2. Running Evaluation

Traditional OCR Metrics

Upon running main.py, run evaluation.py to compute ROUGE-1, ROUGE-L, and Edit Distance metrics:

python evaluation.py

Results are saved as results/{model_tag}_zero-shot_rouge1_eval.csv.

LLM-as-Judge

In llm-as-a-judge-prompt.py, GPT-4o serves as the evaluator responsible for extracting financial entities (Numbers, Dates, Monetary Units, etc.) from the ground-truth HTML and verifying their presence in the model-generated HTML. The LLM Judge performs normalization, contextual matching, and fine-grained fact checking under a structured evaluation prompt.

🪶Citation

If you find this work useful, please cite:

@misc{he2025fincriticaledvisualbenchmarkfinancial,
      title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation}, 
      author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Ruoyu Xiang and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
      year={2025},
      eprint={2511.14998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14998}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Annotation		Annotation
docs		docs
model_eval		model_eval
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

📜Abstract

🏆Results

⚙️Usage

1. Running Models

1. Before running models, configure the `MODELS` list in `main.py` by uncommenting the model(s) you want to run.

2. Create a `.env` file in `model_eval/` with the relevant API keys:

3. Run `main.py` to generate model OCR output:

2. Running Evaluation

Traditional OCR Metrics

LLM-as-Judge

🪶Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

📜Abstract

🏆Results

⚙️Usage

1. Running Models

1. Before running models, configure the MODELS list in main.py by uncommenting the model(s) you want to run.

2. Create a .env file in model_eval/ with the relevant API keys:

3. Run main.py to generate model OCR output:

2. Running Evaluation

Traditional OCR Metrics

LLM-as-Judge

🪶Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Before running models, configure the `MODELS` list in `main.py` by uncommenting the model(s) you want to run.

2. Create a `.env` file in `model_eval/` with the relevant API keys:

3. Run `main.py` to generate model OCR output:

Packages