Skip to content

Latest commit

 

History

History
214 lines (165 loc) · 4.75 KB

File metadata and controls

214 lines (165 loc) · 4.75 KB

Document Processing & OCR Pipeline (Async Backend)

A production-style, fully async backend service built using open-source technologies only.
This project demonstrates how to design and implement a scalable document processing system with background jobs, caching, rate limiting, pagination, structured logging, and proper testing.

This repository is intentionally focused on backend engineering practices (no UI, no paid services).


🚀 Project Overview

The system allows users to:

  1. Upload documents (images / simple PDFs)
  2. Process them asynchronously using a background worker
  3. Extract text using OCR
  4. Track job status in real time
  5. Fetch results efficiently using Redis caching

All APIs are non-blocking and built using async Python.


🧱 Architecture (High Level)

  • FastAPI (Async) — REST API layer
  • SQLAlchemy Async ORM — Database access
  • MySQL — Primary persistent storage
  • Celery + Redis — Background job processing
  • Redis — Caching + rate limiting
  • Tesseract OCR — Text extraction (open source)
  • Docker + Docker Compose — Local orchestration
  • pytest — Async unit testing

✨ Implemented Features

✅ Core Functionality

  • Document upload API
  • Background OCR processing
  • Job status tracking
  • OCR result storage
  • Async I/O across API and DB layer

✅ Performance & Scalability

  • Redis caching for job status and results
  • Token-bucket rate limiting using Redis (per client)
  • Pagination and filtering for document listing APIs

✅ Reliability

  • Structured JSON logging for API and worker
  • Error handling for failed OCR jobs
  • Separation of concerns (API, services, workers)

✅ Testing

  • Async unit tests using pytest and pytest-asyncio
  • API tests for upload and job status
  • Redis behavior validation
  • Celery task enqueue mocked during tests

🗂 Project Structure

document-processor/
├── app/
│   ├── api/
│   │   └── v1/
│   │       ├── upload.py
│   │       ├── jobs.py
│   │       └── documents.py
│   ├── core/
│   │   ├── config.py
│   │   ├── redis_client.py
│   │   ├── ratelimit.py
│   │   └── logging_config.py
│   ├── db/
│   │   ├── base.py
│   │   ├── session.py
│   │   └── models.py
│   ├── services/
│   │   ├── storage.py
│   │   └── ocr.py
│   ├── workers/
│   │   ├── celery_app.py
│   │   └── tasks.py
│   └── main.py
├── tests/
│   ├── conftest.py
│   ├── test_upload_api.py
│   └── test_job_status_api.py
├── Dockerfile
├── worker.Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

🔌 API Endpoints

Upload Document

POST /v1/documents/upload-document
  • Upload a document for OCR processing
  • Returns a job_id for tracking

Get Job Status

GET /v1/jobs/{job_id}
  • Returns job status (pending, processing, completed, failed)
  • Uses Redis cache for fast reads

List Documents (Paginated & Filtered)

GET /v1/documents/list-document

Query Parameters:

  • limit (default: 10)
  • offset (default: 0)
  • filename
  • content_type
  • date_from
  • date_to
  • sort (created_at_asc / created_at_desc)

⚡ Rate Limiting

  • Implemented using Redis Token Bucket algorithm
  • Per-client (IP-based) throttling
  • Configurable capacity and refill rate via environment variables
  • Returns HTTP 429 Too Many Requests when limit exceeded

📊 Logging

  • Structured JSON logs
  • Separate service identifiers for:
    • API (service: api)
    • Worker (service: worker)
  • Request-level logging middleware
  • Logs are ready for centralized log systems (ELK, Loki, etc.)

🧪 Testing

Run tests locally:

pytest -v

What is tested:

  • Upload API behavior
  • Job status API (Redis cache + DB fallback)
  • Celery task enqueue (mocked)
  • Redis isolation per test

🐳 Running Locally

1. Create environment file

cp .env.example .env

2. Build and start services

docker-compose up --build

3. API will be available at

http://localhost:8000

Swagger UI:

http://localhost:8000/docs

🧠 Key Backend Concepts Demonstrated

  • Async API design
  • Background job orchestration
  • Cache-first read strategy
  • Token-bucket rate limiting
  • Structured logging
  • Clean architecture separation
  • Realistic testing strategy

📌 Notes

  • This project uses only free and open-source technologies
  • Designed as a backend-focused portfolio project
  • No authentication, migrations, or CI are included intentionally