A comprehensive Python toolkit for generating synthetic OCR training data with support for multiple languages, fonts, and advanced data augmentation techniques.
π A OCR training data generation toolkit with multi-language support and advanced data augmentation
π¨π³ δΈζζζ‘£ | π Documentation | π― Examples
- ποΈ Modular Architecture: Refactored from prototype to production-ready toolkit
- π Multi-language Support: English, Chinese (Simplified/Traditional), Digits
- β‘ High Performance: Multi-process parallel processing, 100+ samples/sec
- π¨ Data Augmentation: 10+ image enhancement techniques (perspective, noise, blur, etc.)
- πΌ Production Ready: Complete configuration management, quality control, error handling
Intelligent composition of individual character images into line-level handwriting:
Individual Character Images + Corpus Text β Smart Composition β Line Images + Annotations
- DigitGenerator: Digit sequences (6-12 digits, separators supported)
- TextGenerator: English text (names, addresses, sentences)
- ChineseGenerator: Chinese text (simplified/traditional support)
- HandwritingLineGenerator: Character composition based handwriting generation
pip install -r requirements.txt# Generate digit samples
python quick_start.py digits --samples 1000 --output ./digit_data
# Generate English text samples
python quick_start.py english --samples 500 --font-size 28
# Generate Chinese text samples
python quick_start.py chinese --samples 300 --output ./chinese_data
# Generate mixed dataset
python quick_start.py mixed --output ./mixed_datafrom ocr_data_generator import DigitGenerator
from ocr_data_generator.config.settings import GeneratorConfig
# Configure generator
config = GeneratorConfig(
language="digits",
font_size=32,
min_length=6,
max_length=12,
augmentation=True
)
# Create generator and generate samples
generator = DigitGenerator(config)
samples = generator.generate_batch(
batch_size=1000,
output_dir="./output",
save_labels=True
)# examples/example_digits.py
python3 examples/example_digits.pyGenerates digit sequences like student IDs, phone numbers, etc.
# examples/example_mixed_dataset.py
python3 examples/example_mixed_dataset.pyCreates a comprehensive dataset with English, Chinese, and digits.
# examples/example_config_based.py
python3 examples/example_config_based.pyUses JSON configuration files for complex dataset specifications.
ocr_data_generator/
βββ core/
β βββ base_generator.py # Abstract base generator class
βββ generators/
β βββ digit_generator.py # Digit sequence generation
β βββ text_generator.py # English text generation
β βββ chinese_generator.py # Chinese text generation
βββ utils/
β βββ image_processor.py # Image processing utilities
β βββ data_augmentation.py # Data augmentation techniques
β βββ helpers.py # Utility functions
βββ config/
β βββ settings.py # Configuration management
βββ examples/
βββ example_digits.py # Basic digit generation
βββ example_mixed_dataset.py # Multi-language dataset
βββ example_config_based.py # Configuration-driven generation
- BaseGenerator: Abstract base class with common functionality
- DigitGenerator: Specialized for numerical sequences (IDs, phone numbers)
- TextGenerator: English text and mixed alphanumeric content
- ChineseGenerator: Chinese characters with punctuation support
- Geometric Transformations: Rotation, perspective distortion
- Photometric Effects: Brightness, contrast adjustment
- Noise Addition: Gaussian noise, blur effects
- Stroke Modifications: Thickness adjustment, gap insertion
- Background Integration: Texture blending, copy effects
config = GeneratorConfig(
language="digits", # "digits", "en", "zh"
font_size=32, # Font size in pixels
image_size=(64, 400), # (height, width)
output_format="jpg", # "jpg", "png"
augmentation=True, # Enable data augmentation
# Language-specific parameters
min_length=6, # Minimum text length
max_length=12, # Maximum text length
include_punctuation=True, # Include punctuation marks
# Digit-specific parameters
cell_width=40, # Width per digit cell
add_separators=True, # Add separators like "-", " "
# Custom parameters
custom_param="value" # Additional parameters
){
"dataset_name": "my_ocr_dataset",
"output_directory": "./output",
"total_samples": 1000,
"generators": {
"digits": {
"enabled": true,
"samples": 400,
"config": {
"min_length": 6,
"max_length": 12,
"font_size": 32
}
},
"english": {
"enabled": true,
"samples": 400,
"config": {
"min_length": 3,
"max_length": 20,
"font_size": 28
}
}
},
"augmentation": {
"enabled": true,
"probability": 0.7
}
}config = GeneratorConfig(
augmentation=True,
perspective_prob=0.3, # Perspective transform probability
noise_prob=0.2, # Noise addition probability
blur_prob=0.1, # Blur effect probability
brightness_range=(0.8, 1.2) # Brightness variation range
)from ocr_data_generator import HandwritingLineGenerator
# Configure character composition generator
config = {
'char_dict_path': './merged_dict.txt',
'char_image_directory': './chinese_data/',
'corpus_directory': './corpus',
'corpus_files': ['text_corpus.txt']
}
generator = HandwritingLineGenerator(config)
samples = generator.generate_line_samples(1000, './output')- Generate large amounts of annotated data, reducing manual annotation costs
- Diverse fonts and styles improve model generalization
- Controllable data distribution for specific scenario optimization
- Expansion and enhancement of existing datasets
- Synthetic generation of rare samples
- Data simulation under different conditions
- Standard datasets for algorithm performance testing
- Test samples of different difficulty levels
- Reproducible experimental data
config = GeneratorConfig(
language="digits",
min_length=10,
max_length=10,
cell_width=35,
add_separators=False
)config = GeneratorConfig(
language="en",
min_length=6,
max_length=8,
include_numbers=True,
font_size=36
)config = GeneratorConfig(
language="zh",
min_length=8,
max_length=20,
include_punctuation=True,
font_size=28
)# Mixed content generator
from ocr_data_generator.generators.text_generator import FormTextGenerator
generator = FormTextGenerator()
# Generates names, addresses, phone numbers, etc.- Character Composition: Intelligent combination of individual character images into line-level images
- Smart Alignment: Automatic adjustment of character spacing and baseline alignment
- Style Preservation: Maintains original handwriting character style features
- Modular Refactoring: From messy code to clean architecture
- Object-Oriented: Abstract base classes and inheritance system
- Plugin-based: Easy to extend new generators and enhancement techniques
- Parallel Processing: Multi-process batch generation
- Memory Management: Smart caching and memory optimization
- Quality Control: Automatic quality detection and filtering
python -m pytest tests/# Format code
black .
# Code linting
flake8 .- Project Summary - Detailed technical documentation
- Resume Format - Resume description templates
- API Documentation - Complete API reference
- Usage Examples - Code examples
from ocr_data_generator.utils.helpers import FontManager
# Scan for fonts
fonts = FontManager.scan_fonts(["/path/to/fonts"])
# Filter by language
chinese_fonts = FontManager.filter_fonts_by_language(fonts, "zh")from ocr_data_generator.utils.helpers import DatasetManager
manager = DatasetManager("./datasets")
# Create dataset structure
dataset_path = manager.create_dataset_structure("my_dataset")
# Split into train/val/test
splits = manager.split_dataset(
image_dir="./images",
output_dir="./splits",
train_ratio=0.7,
val_ratio=0.15,
test_ratio=0.15
)# Generate large datasets with parallel processing
samples = generator.generate_parallel(
total_samples=10000,
output_dir="./large_dataset",
batch_size=1000,
max_workers=8
)import logging
logging.basicConfig(level=logging.DEBUG)
# Generators will output detailed debug information- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or feature requests:
- Check the examples directory
- Review the troubleshooting section
- Open an issue on GitHub
Author: Thomas Zhang
Date: 2024-09-19
Version: 1.0.0