Document Processing & OCR Pipeline (Async Backend)

A production-style, fully async backend service built using open-source technologies only.
This project demonstrates how to design and implement a scalable document processing system with background jobs, caching, rate limiting, pagination, structured logging, and proper testing.

This repository is intentionally focused on backend engineering practices (no UI, no paid services).

🚀 Project Overview

The system allows users to:

Upload documents (images / simple PDFs)
Process them asynchronously using a background worker
Extract text using OCR
Track job status in real time
Fetch results efficiently using Redis caching

All APIs are non-blocking and built using async Python.

🧱 Architecture (High Level)

FastAPI (Async) — REST API layer
SQLAlchemy Async ORM — Database access
MySQL — Primary persistent storage
Celery + Redis — Background job processing
Redis — Caching + rate limiting
Tesseract OCR — Text extraction (open source)
Docker + Docker Compose — Local orchestration
pytest — Async unit testing

✨ Implemented Features

✅ Core Functionality

Document upload API
Background OCR processing
Job status tracking
OCR result storage
Async I/O across API and DB layer

✅ Performance & Scalability

Redis caching for job status and results
Token-bucket rate limiting using Redis (per client)
Pagination and filtering for document listing APIs

✅ Reliability

Structured JSON logging for API and worker
Error handling for failed OCR jobs
Separation of concerns (API, services, workers)

✅ Testing

Async unit tests using pytest and pytest-asyncio
API tests for upload and job status
Redis behavior validation
Celery task enqueue mocked during tests

🗂 Project Structure

document-processor/
├── app/
│   ├── api/
│   │   └── v1/
│   │       ├── upload.py
│   │       ├── jobs.py
│   │       └── documents.py
│   ├── core/
│   │   ├── config.py
│   │   ├── redis_client.py
│   │   ├── ratelimit.py
│   │   └── logging_config.py
│   ├── db/
│   │   ├── base.py
│   │   ├── session.py
│   │   └── models.py
│   ├── services/
│   │   ├── storage.py
│   │   └── ocr.py
│   ├── workers/
│   │   ├── celery_app.py
│   │   └── tasks.py
│   └── main.py
├── tests/
│   ├── conftest.py
│   ├── test_upload_api.py
│   └── test_job_status_api.py
├── Dockerfile
├── worker.Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

🔌 API Endpoints

Upload Document

POST /v1/documents/upload-document

Upload a document for OCR processing
Returns a job_id for tracking

Get Job Status

GET /v1/jobs/{job_id}

Returns job status (pending, processing, completed, failed)
Uses Redis cache for fast reads

List Documents (Paginated & Filtered)

GET /v1/documents/list-document

Query Parameters:

limit (default: 10)
offset (default: 0)
filename
content_type
date_from
date_to
sort (created_at_asc / created_at_desc)

⚡ Rate Limiting

Implemented using Redis Token Bucket algorithm
Per-client (IP-based) throttling
Configurable capacity and refill rate via environment variables
Returns HTTP 429 Too Many Requests when limit exceeded

📊 Logging

Structured JSON logs
Separate service identifiers for:
- API (service: api)
- Worker (service: worker)
Request-level logging middleware
Logs are ready for centralized log systems (ELK, Loki, etc.)

🧪 Testing

Run tests locally:

pytest -v

What is tested:

Upload API behavior
Job status API (Redis cache + DB fallback)
Celery task enqueue (mocked)
Redis isolation per test

🐳 Running Locally

1. Create environment file

cp .env.example .env

2. Build and start services

docker-compose up --build

3. API will be available at

http://localhost:8000

Swagger UI:

http://localhost:8000/docs

🧠 Key Backend Concepts Demonstrated

Async API design
Background job orchestration
Cache-first read strategy
Token-bucket rate limiting
Structured logging
Clean architecture separation
Realistic testing strategy

📌 Notes

This project uses only free and open-source technologies
Designed as a backend-focused portfolio project
No authentication, migrations, or CI are included intentionally

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Processing & OCR Pipeline (Async Backend)

🚀 Project Overview

🧱 Architecture (High Level)

✨ Implemented Features

✅ Core Functionality

✅ Performance & Scalability

✅ Reliability

✅ Testing

🗂 Project Structure

🔌 API Endpoints

Upload Document

Get Job Status

List Documents (Paginated & Filtered)

⚡ Rate Limiting

📊 Logging

🧪 Testing

🐳 Running Locally

1. Create environment file

2. Build and start services

3. API will be available at

🧠 Key Backend Concepts Demonstrated

📌 Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Document Processing & OCR Pipeline (Async Backend)

🚀 Project Overview

🧱 Architecture (High Level)

✨ Implemented Features

✅ Core Functionality

✅ Performance & Scalability

✅ Reliability

✅ Testing

🗂 Project Structure

🔌 API Endpoints

Upload Document

Get Job Status

List Documents (Paginated & Filtered)

⚡ Rate Limiting

📊 Logging

🧪 Testing

🐳 Running Locally

1. Create environment file

2. Build and start services

3. API will be available at

🧠 Key Backend Concepts Demonstrated

📌 Notes