Skip to content

Add async parallel processing for batch operations #12

@ahmed-sekka

Description

@ahmed-sekka

Description

Implement parallel processing for batch operations to significantly improve performance on large document sets.

Current behavior

  • Documents processed sequentially
  • Single-threaded

Proposed behavior

# Process with 4 parallel workers
ragctl batch ./documents --workers 4 --output ./chunks/

Expected improvements

  • 3-5x speedup on multi-core systems
  • Better CPU utilization
  • Configurable worker count

Tasks

  • Add --workers / -j option (default: 1)
  • Implement process pool or thread pool executor
  • Handle errors gracefully per-worker
  • Aggregate results correctly
  • Show per-worker progress
  • Add worker count to history/logs
  • Benchmark and document performance gains

Technical considerations

  • Use concurrent.futures.ProcessPoolExecutor for CPU-bound OCR
  • Use ThreadPoolExecutor for I/O-bound operations
  • Ensure thread-safe history writing
  • Handle keyboard interrupt gracefully

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions