This repository contains scripts, configuration, and sample results from testing OCR quality across a variety of document types in the New York Philharmonic Archives.
To evaluate whether we can rely primarily on open-source Tesseract or whether Amazon Textract is needed for complex documents (e.g., overlaid stamps, handwriting).
- ✅ Tesseract (with tuned settings) performs well on clean, typed materials.
- ❌ Tesseract struggles with:
- Overlaid stamps (e.g., "COPY")
- Handwriting
- ✅ Textract produced much better results in those cases.
-
Use Tesseract for most documents with:
--psm 3 --oem 1- Grayscale JPEGs converted from JP2 with enhanced contrast
-
Use Amazon Textract for:
- Stamped letters
- Handwritten annotations
ocr_comparison.py: main OCR test harnesstest_images/: location for input images (JP2s, JPGs, etc.)results/: sample outputs for Tesseract and Textract
- Install Python dependency:
pip install boto3- Configure your AWS credentials (for Textract):
aws configure --profile textract-test- Run the script:
python ocr_comparison.py