Skip to content

The-FinAI/FinCriticalED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

Yueru He, Xueqing Peng*, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Shuyao Wang, Ruoyu Xiang, Fan Zhang, Zhuohan Xie, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

*Corresponding author, xueqing.peng2024@gmail.com

📖 Paper🤗 Dataset🌏 WebPage

arXiv Dataset License

📜Abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong OCR performance on surface metrics does not necessarily imply faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors,such as a missing negative marker, shifted decimal point, incorrect unit scale, or misaligned reporting date, can induce materially different interpretations.

To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating OCR and vision-language systems through the lens of evidence fidelity in high-stakes document understanding. FinCriticalED contains 859 real-world financial document pages paired with ground-truth HTML, with 9,481 expert-annotated facts spanning five financially critical field types: Numbers, Monetary Units, Temporal Data, Reporting Entities, and Financial Concepts.

We further develop an evaluation suite, including critical-field-aware metrics and a context-aware protocol, to assess whether model outputs preserve financially critical facts beyond lexical similarity. We benchmark 13 OCR pipelines, OCR-native models, open-source VLMs, and proprietary MLLMs on FinCriticalED. Results show that conventional OCR metrics can substantially overestimate factual reliability, and that OCR-specialized systems may outperform much larger general-purpose MLLMs in preserving critical financial evidence under complex layouts. FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a broader testbed for high-stakes multimodal document understanding.

🏆Results

Model performance on FinCriticalED benchmark:

Model Size General (%) Fact-Level (%)
R1RLE↓Rank N-FFAT-FFAM-FFAR-FFAFC-FFAFFARank
OCR Pipelines
MinerU2.51.2B ---- -------
PP-OCRv50.07B 97.5496.553.10- -------
Specialized OCR VLMs
DeepSeek-OCR6B-----------
DeepSeek-OCR-23B-----------
GLM-OCR0.9B-----------
Open-source MLLMs
Gemma-3n-E4B-it4B83.4979.5923.82--------
Qwen3-VL-8B-Instruct8B-----------
Llama-4-Maverick17B98.0097.623.70--------
Qwen3.5-397B-A17B397B-----------
Proprietary MLLMs
GPT-4o------------
GPT-5------------
Claude-Sonnet-4.6-98.8498.731.69--------
Gemini-2.5-Pro------------

R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy. - = results pending.

⚙️Usage

1. Running Models

1. Before running models, configure the MODELS list in main.py by uncommenting the model(s) you want to run.

2. Create a .env file in model_eval/ with the relevant API keys:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
TOGETHER_API_KEY=...
ZAI_API_KEY = ...

3. Run main.py to generate model OCR output:

cd model_eval
python main.py
  • Results are saved to results/{model_tag}_zero-shot/pred_{i}.txt. Already-completed samples are skipped on re-runs.

  • To limit the number of samples for a quick test, set max_samples in main():

max_samples = 10

  • Supported Models Cloud VLMs (API key required, no local setup):
Model key in MODELS Provider API Key
gpt-4o OpenAI OPENAI_API_KEY
gpt-5 OpenAI OPENAI_API_KEY
claude-sonnet-4-6 Anthropic ANTHROPIC_API_KEY
gemini-2.5-pro Google GOOGLE_API_KEY
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 Together AI TOGETHER_API_KEY
Qwen/Qwen2.5-VL-72B-Instruct Together AI TOGETHER_API_KEY
glm-ocr ZAI ZAI_API_KEY
  • Local OCR Pipelines (no API key, local setup required):
Model key in MODELS Description
paddleocrv5 PP-OCRv5 plain text OCR — outputs raw text lines
monkeyocr MonkeyOCR — requires a running local server
mineru MinerU2.5 — runs locally via HuggingFace
deepseekocr DeepSeek-OCR — runs locally via HuggingFace
deepseekocr2 DeepSeek-OCR — runs locally via HuggingFace
  • Setting up paddleocr and paddleocrv5-ppstructure Linux or WSL is required. PaddlePaddle has limited Windows support.

Install dependencies:

pip install paddleocr paddlepaddle
# For paddleocrv5-ppstructure, also install:
pip install shapely pyclipper scikit-image imutils lmdb

Model weights (~200 MB) are downloaded automatically on first run.

paddleocrv5 runs PP-OCRv5 locally on CPU and outputs plain text. Some samples may raise a PaddlePaddle oneDNN/PIR compatibility error — these are caught and skipped automatically. No configuration needed beyond the install above.

paddleocrv5-table uses TableRecognitionPipelineV2 to detect tables and output them as HTML (<table>/<tr>/<td>), with all other text as

tags — producing a full HTML document per page. Note: PPStructureV3 requires CUDA and is not supported in CPU-only environments.

  • Setting up monkeyocr

MonkeyOCR requires a running HTTP server. Start the server separately before running main.py (see the MonkeyOCR repository for server setup instructions).

Set the server URL via environment variable (defaults to http://localhost:8000):

MONKEYOCR_API_URL=http://your-server:8000

2. Running Evaluation

Traditional OCR Metrics

Upon running main.py, run evaluation.py to compute ROUGE-1, ROUGE-L, and Edit Distance metrics:

python evaluation.py

Results are saved as results/{model_tag}_zero-shot_rouge1_eval.csv.

LLM-as-Judge

In llm-as-a-judge-prompt.py, GPT-4o serves as the evaluator responsible for extracting financial entities (Numbers, Dates, Monetary Units, etc.) from the ground-truth HTML and verifying their presence in the model-generated HTML. The LLM Judge performs normalization, contextual matching, and fine-grained fact checking under a structured evaluation prompt.

🪶Citation

If you find this work useful, please cite:

@misc{he2025fincriticaledvisualbenchmarkfinancial,
      title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation}, 
      author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Ruoyu Xiang and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
      year={2025},
      eprint={2511.14998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14998}, 
}

About

Repo for FinCritalED

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors