Skip to content

Google Drive as Data Source (Replace Azure Blob) #20

@W1ndrunn3rr

Description

@W1ndrunn3rr

Replace Azure Blob in data_acquisition.py with Google Drive as the pipeline data source.

  • Authenticate via Google Drive API (service account JSON, stored as secret)
  • List files from Drive folder (support PDF, DOCX, TXT)
  • Download to temp dir, then pass paths to ocr_extraction
  • Add GOOGLE_DRIVE_FOLDER_ID, GOOGLE_SERVICE_ACCOUNT_JSON to .env.example
  • Handle pagination for folders exceeding 100 files

Metadata

Metadata

Assignees

Labels

DataData related task

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions