Skip to content

AI-powered Gmail processor that finds relevant emails and attachments, extracts structured trip/billing details, and exports matched results to auditable PDFs.

License

Notifications You must be signed in to change notification settings

rokernel/gmail-ai-processor

Repository files navigation

Gmail AI Processor

Search Gmail, read email and attachment content, extract trip details, and export only matching results as PDFs.

This tool is built for workflows like travel bills, invoices, and receipts where you need a clear filter rule and auditable output.

⚠️ Vibe Coding Disclaimer: This project was built using AI-assisted development ("vibe coding"). While functional, it may contain rough edges, unconventional patterns, or areas that could benefit from refinement. Use at your own discretion and feel free to contribute improvements.

Quick Start

git clone https://github.com/rokernel/gmail-ai-processor.git
cd gmail-ai-processor
uv sync
uv run playwright install chromium
cp .env.example .env

Then add your keys to .env, place credentials.json in the project root, and run:

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-01-31 \
  -t "receipts" \
  --has-attachments --dry-run --verbose

What It Does

  • Connects to Gmail with OAuth2 and searches by date range, subject, and attachment presence
  • Downloads full message content and full attachment payloads
  • Parses attachment text (PDF, DOCX, XLSX, images via OCR, nested .eml)
  • Extracts structured fields like Ticket-ID, Valid: ranges, amount, and order/reference numbers
  • Uses MiniMax AI to summarize content and classify matches against your command
  • Exports only matched artifacts to PDF and writes an analysis report

Matching Architecture

The pipeline is hybrid by default for reliability and speed:

  1. Deterministic extraction and rule checks (weekday/time windows, ticket ranges)
  2. AI summary and classification for attachments that need semantic understanding
  3. Final export only for matched items

If you want AI on every unique attachment, use --ai-all-attachments.


Prerequisites

  • Python 3.10+
  • Google Cloud project with Gmail API enabled
  • MiniMax API key
  • Playwright Chromium runtime

Optional:

  • Tesseract OCR for better image/scanned PDF extraction

Setup

1) Clone and Install

git clone <repository-url>
cd gmail-ai-processor
uv sync
uv run playwright install chromium

2) Google OAuth Setup

  1. Go to Google Cloud Console
  2. Create or select a project
  3. Enable the Gmail API
  4. Create OAuth client credentials for a Desktop app
  5. Download the credentials file
  6. Rename it to credentials.json and place it in the project root

Security Note: The credentials.json file is in .gitignore and will NOT be committed to GitHub.

3) Configure Environment Variables

Copy the example environment file and add your API keys:

cp .env.example .env

Edit .env with your actual values:

MINIMAX_API_KEY=your_minimax_api_key_here
MINIMAX_BASE_URL=https://api.minimax.io/v1
MINIMAX_MODEL=MiniMax-M2.5

# Optional: OpenAI fallback
OPENAI_API_KEY=your_openai_key_here
OPENAI_MODEL=gpt-4o-mini

# Optional: Override default paths
GMAIL_CREDENTIALS_PATH=./credentials.json
GMAIL_TOKEN_PATH=./token.json
DEFAULT_OUTPUT_DIR=./exports
AI_TIMEOUT=120

Security Note: The .env file is in .gitignore and will NOT be committed to GitHub.

4) First Run (OAuth Authorization)

Run once to trigger the OAuth flow and create your token:

uv run python -m gmail_processor --help

Or start processing:

uv run python -m gmail_processor -s 2026-01-01 -e 2026-01-31 -t "test"

A browser window will open asking you to authenticate with Google. After authorization, a token.json file will be created for future runs.

Security Note: The token.json file is in .gitignore and will NOT be committed to GitHub.


CLI Options

Option Short Description
--start-date -s Start date (YYYY-MM-DD) required
--end-date -e End date (YYYY-MM-DD) required
--topic -t Topic string for AI matching required
--subject Filter by Gmail subject line
--ai-command Custom natural-language extraction instruction
--has-attachments Only process emails with attachments
--ai-all-attachments Run AI on every attachment (slower, more thorough)
--max-results Maximum emails to process (default: 100)
--output-dir -o Output directory for exports
--dry-run Analyze only, do not save files
--verbose -v Enable verbose logging

Usage Examples

1) Basic Search

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-01-31 \
  -t "invoice" \
  --subject "billing"

2) Travel Receipts with Attachments

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-02-26 \
  -t "receipts" \
  --has-attachments

3) SBB/CFF Trip Bills (Weekday Morning Filter)

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-02-26 \
  -t "bills" \
  --subject "emails" \
  --has-attachments \
  --ai-command "open for SBB CFF bills and export the ones that are from Monday to Friday from 8AM to 12AM" \
  --max-results 300

4) Deep AI Analysis on All Attachments

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-02-26 \
  -t "bills" \
  --subject "emails" \
  --has-attachments \
  --ai-all-attachments \
  --ai-command "find all SBB CFF trips and summarize each ticket"

5) Dry Run (Preview Only)

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-02-26 \
  -t "bills" \
  --has-attachments \
  --ai-command "extract trip details and list ticket ids" \
  --dry-run --verbose

6) High-Volume Processing

uv run python -m gmail_processor \
  -s 2025-01-01 -e 2026-02-26 \
  -t "travel" \
  --subject "SBB" \
  --has-attachments \
  --max-results 1000

Demo

Terminal run (--dry-run)

See a full sample run output in docs/demo/cli-dry-run.txt.

uv run python -m gmail_processor \
  -s 2026-01-01 -e 2026-01-31 \
  -t "receipts" \
  --has-attachments \
  --dry-run --verbose

Export structure preview

See a sample export tree in docs/demo/output-tree.txt.


Output Structure

When matches are found, the following structure is created:

exports/
  <topic>/
    <date>_<subject>_<id>/
      email.eml          # Original email
      email.pdf          # Email rendered as PDF
      analysis.json      # AI analysis results
      <attachment-1>.pdf # Converted/saved attachment
      <attachment-2>.pdf

Analysis JSON Format

The analysis.json file contains:

{
  "command": "the AI command used",
  "email_subject": "Email subject",
  "email_sender": "sender@example.com",
  "email_date": "2026-01-15",
  "attachments_analyzed": [
    {
      "index": 1,
      "filename": "ticket.pdf",
      "matched": true,
      "confidence": 0.95,
      "summary": "Trip from Zurich to Geneva",
      "trip_hours": ["08:30", "09:15"],
      "trip_details": {
        "ticket_id": "12345",
        "valid_from": "15.01.2026 08:30",
        "valid_to": "15.01.2026 09:15",
        "amount": "25.00 CHF"
      },
      "reasoning": "Matches weekday morning criteria",
      "needs_review": false
    }
  ]
}

Time Window Interpretation

The AI understands natural language time constraints:

  • from 8AM to 12AM → Morning window (08:00 to 12:00)
  • from 8AM to 6PM → Full workday (08:00 to 18:00)
  • weekdays → Monday through Friday
  • weekends → Saturday and Sunday

Note: If you mean midnight, explicitly say to midnight or to 11:59PM.


Troubleshooting

No Messages Found

  • Run with --verbose to see the Gmail query being used
  • Try adding or removing --subject filter
  • Verify your date range is correct

Wrong Gmail Account

If you authenticated with the wrong account:

rm token.json
# Re-run and authenticate with the correct account

Playwright Browser Missing

uv run playwright install chromium

AI Model/Authentication Issues

  • Verify MINIMAX_API_KEY is set correctly in .env
  • Check MINIMAX_BASE_URL matches your provider
  • Try setting a different MINIMAX_MODEL if the default isn't available

Slow Processing

  • Use --subject to filter more specifically
  • Lower --max-results for smaller batches
  • Avoid --ai-all-attachments unless necessary (default mode is faster)

Development

Running Tests

uv run pytest

Code Quality Checks

uv run ruff check .
uv run black .
uv run mypy src/

Building

uv build

Security Considerations

⚠️ Important Security Notes:

  1. Never commit credential files: credentials.json, token.json, and .env are in .gitignore by default
  2. Rotate exposed credentials: If you accidentally committed credentials, revoke them immediately in Google Cloud Console
  3. Token security: The token.json file contains OAuth tokens - treat it like a password
  4. API keys: Keep your MiniMax/OpenAI API keys in .env only
  5. Scope limitation: This app only requests gmail.readonly scope - it cannot send emails or modify your inbox

Credential Files Explained

File Purpose Should Commit?
credentials.json Google OAuth client credentials ❌ NO
token.json OAuth access/refresh tokens ❌ NO
.env API keys and configuration ❌ NO
.env.example Template for .env ✅ YES

Architecture Overview

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   CLI Input │────▶│ Gmail Client │────▶│  Search Emails  │
└─────────────┘     └──────────────┘     └─────────────────┘
                                                  │
                    ┌──────────────┐              ▼
                    │ Export PDFs  │◀────┌──────────────────┐
                    └──────────────┘     │ Process Attachm. │
                           ▲             └──────────────────┘
                           │                      │
                    ┌──────┴──────┐              ▼
                    │   Storage   │     ┌──────────────────┐
                    └─────────────┘◀────│   AI Analyzer    │
                                        └──────────────────┘

License

MIT License - See LICENSE file for details


Contributing

This is a vibe-coded project - contributions are welcome! Feel free to:

  • Open issues for bugs or feature requests
  • Submit pull requests with improvements
  • Suggest better patterns or refactoring

Acknowledgments

Built with:

About

AI-powered Gmail processor that finds relevant emails and attachments, extracts structured trip/billing details, and exports matched results to auditable PDFs.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages