OCR Comparison: Tesseract vs Amazon Textract

This repository contains scripts, configuration, and sample results from testing OCR quality across a variety of document types in the New York Philharmonic Archives.

Objective

To evaluate whether we can rely primarily on open-source Tesseract or whether Amazon Textract is needed for complex documents (e.g., overlaid stamps, handwriting).

Summary of Findings

✅ Tesseract (with tuned settings) performs well on clean, typed materials.
❌ Tesseract struggles with:
- Overlaid stamps (e.g., "COPY")
- Handwriting
✅ Textract produced much better results in those cases.

Recommended Strategy

Use Tesseract for most documents with:
- --psm 3 --oem 1
- Grayscale JPEGs converted from JP2 with enhanced contrast
Use Amazon Textract for:
- Stamped letters
- Handwritten annotations

How to Run

Install Python dependency:

pip install boto3

Configure your AWS credentials (for Textract):

aws configure --profile textract-test

Run the script:

python ocr_comparison.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
results		results
test_images		test_images
.gitignore		.gitignore
README.md		README.md
generate_textract_hocr.py		generate_textract_hocr.py
ocr_comparison.py		ocr_comparison.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Comparison: Tesseract vs Amazon Textract

Objective

Summary of Findings

Recommended Strategy

Contents

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR Comparison: Tesseract vs Amazon Textract

Objective

Summary of Findings

Recommended Strategy

Contents

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages