Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .idea/FastFetchBot.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 9 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Project Overview

FastFetchBot is a social media content fetching service built as a **UV workspace monorepo** with three microservices: a FastAPI server (API), a Telegram Bot client, and a Celery worker for file operations. It scrapes and archives content from various social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.
FastFetchBot is a social media content fetching service built as a **UV workspace monorepo** with four microservices: a FastAPI server (API), a Telegram Bot client, a Celery worker for file operations, and an ARQ-based async worker for off-path scraping. It scrapes and archives content from various social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.

## Architecture

Expand All @@ -23,11 +23,13 @@ FastFetchBot/
│ │ ├── twitter/ bluesky/ weibo/ xiaohongshu/ reddit/
│ │ ├── instagram/ zhihu/ douban/ threads/ wechat/
│ │ └── general/ # Firecrawl + Zyte generic scraping
│ ├── file_export/ # Async Celery task wrappers (PDF, video, audio transcription)
│ └── telegraph/ # Telegraph content publishing
├── packages/file-export/ # fastfetchbot-file-export: video download, PDF export, transcription
├── packages/file-export/ # fastfetchbot-file-export: synchronous Celery worker jobs (yt-dlp, WeasyPrint, OpenAI)
├── apps/api/ # FastAPI server: enriched service, routing, storage
├── apps/telegram-bot/ # Telegram Bot: webhook/polling, message handling
├── apps/worker/ # Celery worker: async file operations (video, PDF, audio)
├── apps/worker/ # Celery worker: sync file operations (video, PDF, audio)
├── apps/async-worker/ # ARQ async worker: off-path scraping + enrichment
├── pyproject.toml # Root workspace configuration
└── uv.lock # Lockfile for the entire workspace
```
Expand All @@ -37,6 +39,7 @@ FastFetchBot/
| **API Server** (`apps/api/src/`) | `fastfetchbot-api` | 10450 | `gunicorn -k uvicorn.workers.UvicornWorker src.main:app --preload` |
| **Telegram Bot** (`apps/telegram-bot/core/`) | `fastfetchbot-telegram-bot` | 10451 | `python -m core.main` |
| **Worker** (`apps/worker/worker_core/`) | `fastfetchbot-worker` | — | `celery -A worker_core.main:app worker --loglevel=info --concurrency=2` |
| **Async Worker** (`apps/async-worker/async_worker/`) | `fastfetchbot-async-worker` | — | `arq async_worker.main.WorkerSettings` |
| **Shared Library** (`packages/shared/fastfetchbot_shared/`) | `fastfetchbot-shared` | — | — |
| **File Export Library** (`packages/file-export/fastfetchbot_file_export/`) | `fastfetchbot-file-export` | — | — |

Expand Down Expand Up @@ -74,6 +77,7 @@ The Telegram Bot communicates with the API server over HTTP (`API_SERVER_URL`).
- **`templates/`** — 13 Jinja2 templates for platform-specific output formatting (bundled via `__file__`-relative paths)
- **Platform modules**: `twitter/`, `bluesky/`, `weibo/`, `xiaohongshu/`, `reddit/`, `instagram/`, `zhihu/`, `douban/`, `threads/`, `wechat/`, `general/` (Firecrawl + Zyte)
- **`services/telegraph/`** — Telegraph content publishing (creates telegra.ph pages from scraped content)
- **`services/file_export/`** — Async Celery task wrappers for PDF export, video download, and audio transcription. These accept `celery_app` and `timeout` as constructor parameters (dependency injection) so any app can use them with its own Celery client

The shared scrapers library can be used standalone without the API server:
```python
Expand Down Expand Up @@ -179,6 +183,8 @@ GitHub Actions (`.github/workflows/ci.yml`) builds and pushes all three images o
7. Add any new pip dependencies to `packages/shared/pyproject.toml` under `[project.optional-dependencies] scrapers`

### Key Conventions
- **`packages/shared/` (`fastfetchbot-shared`)** is for shared async logic — scrapers, templates, Telegraph, and async Celery task wrappers (file_export). Most code here is async and reusable across apps
- **`packages/file-export/` (`fastfetchbot-file-export`)** is exclusively for synchronous Celery worker jobs — the heavy I/O operations that run inside the Celery worker process (yt-dlp video download, WeasyPrint PDF generation, OpenAI audio transcription). Apps never import this package directly; they use the async wrappers in `fastfetchbot_shared.services.file_export` which submit tasks to the Celery worker
- **Scrapers, templates, and Telegraph live in `packages/shared/`** — they are framework-agnostic and reusable
- Scraper config (platform credentials, Firecrawl/Zyte settings) lives in `fastfetchbot_shared.services.scrapers.config`, **not** in `apps/api/src/config.py`
- API-only config (BASE_URL, MongoDB, Celery, AWS, Inoreader) stays in `apps/api/src/config.py`
Expand Down
35 changes: 11 additions & 24 deletions apps/api/src/services/file_export/audio_transcribe/__init__.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,16 @@
import asyncio
"""API-layer audio transcription — wraps the shared AudioTranscribe with API config."""

from src.config import DOWNLOAD_VIDEO_TIMEOUT
from fastfetchbot_shared.services.file_export.audio_transcribe import AudioTranscribe as BaseAudioTranscribe
from src.services.celery_client import celery_app
from fastfetchbot_shared.utils.logger import logger
from src.config import DOWNLOAD_VIDEO_TIMEOUT


class AudioTranscribe:
def __init__(self, audio_file: str):
self.audio_file = audio_file
class AudioTranscribe(BaseAudioTranscribe):
"""API AudioTranscribe that injects the API's Celery app and timeout."""

async def transcribe(self):
return await self._get_audio_text(self.audio_file)

@staticmethod
async def _get_audio_text(audio_file: str):
logger.info(f"submitting transcribe task: {audio_file}")
result = celery_app.send_task("file_export.transcribe", kwargs={
"audio_file": audio_file,
})
try:
response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
return response["transcript"]
except Exception:
logger.exception(
f"file_export.transcribe task failed: audio_file={audio_file}, "
f"timeout={DOWNLOAD_VIDEO_TIMEOUT}"
)
raise
def __init__(self, audio_file: str):
super().__init__(
audio_file=audio_file,
celery_app=celery_app,
timeout=DOWNLOAD_VIDEO_TIMEOUT,
)
53 changes: 14 additions & 39 deletions apps/api/src/services/file_export/document_export/pdf_export.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import asyncio
import uuid
"""API-layer PDF export — extends the shared PdfExport with S3 upload support."""

from pathlib import Path

import aiofiles.os
from bs4 import BeautifulSoup

from fastfetchbot_shared.services.file_export.pdf_export import PdfExport as BasePdfExport, wrap_html_string
from src.config import DOWNLOAD_VIDEO_TIMEOUT, AWS_STORAGE_ON
from src.services.celery_client import celery_app
from src.services.amazon.s3 import upload as upload_to_s3
Expand All @@ -19,48 +19,23 @@ async def upload_file_to_s3(output_filename):
)


class PdfExport:
class PdfExport(BasePdfExport):
"""API PDF export that adds optional S3 upload after Celery PDF generation."""

def __init__(self, title: str, html_string: str = None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if file exists and examine the context around line 25
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
  echo "=== File found. Examining lines 20-30 ==="
  cat -n apps/api/src/services/file_export/document_export/pdf_export.py | sed -n '20,30p'
else
  echo "File not found at expected path"
  find . -name "pdf_export.py" -type f
fi

Repository: aturret/FastFetchBot

Length of output: 528


🏁 Script executed:

#!/bin/bash
# Check Python version requirement to understand if str | None is supported
if [ -f "setup.py" ] || [ -f "pyproject.toml" ] || [ -f ".python-version" ]; then
  echo "=== Checking Python version requirements ==="
  [ -f "setup.py" ] && head -30 setup.py | grep -i python
  [ -f "pyproject.toml" ] && cat pyproject.toml | grep -A 5 "python"
  [ -f ".python-version" ] && cat .python-version
else
  echo "Checking for version hints in common config files"
  ls -la | grep -E "(setup|requirements|pyproject|\.python)"
fi

Repository: aturret/FastFetchBot

Length of output: 851


🏁 Script executed:

#!/bin/bash
# Check imports and overall structure of the file
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
  echo "=== First 50 lines of file ==="
  head -50 apps/api/src/services/file_export/document_export/pdf_export.py
fi

Repository: aturret/FastFetchBot

Length of output: 1420


🏁 Script executed:

#!/bin/bash
# Check the parent class BasePdfExport to ensure signature consistency
find . -name "pdf_export.py" -type f | xargs grep -l "class.*BasePdfExport\|class PdfExport" | grep -v "__pycache__"

Repository: aturret/FastFetchBot

Length of output: 202


🏁 Script executed:

#!/bin/bash
# Search for other potential implicit Optional issues in this file
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
  echo "=== Full file content ==="
  cat apps/api/src/services/file_export/document_export/pdf_export.py
fi

Repository: aturret/FastFetchBot

Length of output: 1415


🏁 Script executed:

#!/bin/bash
# Check the parent class BasePdfExport signature
if [ -f "packages/shared/fastfetchbot_shared/services/file_export/pdf_export.py" ]; then
  echo "=== BasePdfExport class definition ==="
  cat packages/shared/fastfetchbot_shared/services/file_export/pdf_export.py
fi

Repository: aturret/FastFetchBot

Length of output: 2745


Use explicit Optional type annotation for html_string.

Line 25 has html_string: str = None which triggers Ruff RUF013. Since the project targets Python 3.12, use str | None instead to follow PEP 604 union syntax and satisfy black formatting standards.

✏️ Suggested change
-    def __init__(self, title: str, html_string: str = None):
+    def __init__(self, title: str, html_string: str | None = None):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, title: str, html_string: str = None):
def __init__(self, title: str, html_string: str | None = None):
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 25-25: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/document_export/pdf_export.py` at line 25,
The constructor signature in PDF export uses a nullable parameter annotated as
`html_string: str = None`, which triggers RUF013; change the parameter
annotation in the __init__ method of the class (the PDF export class in
pdf_export.py) to use PEP 604 union syntax: `html_string: str | None = None` so
the type is explicit and conforms to Python 3.12/Black formatting.

self.title = title
self.html_string = html_string
super().__init__(
title=title,
html_string=html_string,
celery_app=celery_app,
timeout=DOWNLOAD_VIDEO_TIMEOUT,
)

async def export(self) -> str:
html_string = self.wrap_html_string(self.html_string)
output_filename = f"{self.title}-{uuid.uuid4()}.pdf"

logger.info(f"submitting pdf export task: {output_filename}")
result = celery_app.send_task("file_export.pdf_export", kwargs={
"html_string": html_string,
"output_filename": output_filename,
})
try:
response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
output_filename = response["output_filename"]
except Exception:
logger.exception(
f"file_export.pdf_export task failed: output_filename={output_filename}, "
f"timeout={DOWNLOAD_VIDEO_TIMEOUT}"
)
raise
logger.info(f"pdf export success: {output_filename}")
output_filename = await super().export()

if AWS_STORAGE_ON:
local_filename = output_filename
output_filename = await upload_file_to_s3(Path(output_filename))
await aiofiles.os.remove(local_filename)
return output_filename

@staticmethod
def wrap_html_string(html_string: str) -> str:
soup = BeautifulSoup(
'<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
'<meta charset="UTF-8"></head><body></body></html>',
"html.parser",
)
soup.body.append(BeautifulSoup(html_string, "html.parser"))
for tag in soup.find_all(True):
if "style" in tag.attrs:
del tag["style"]
for style_tag in soup.find_all("style"):
style_tag.decompose()
return soup.prettify()
return output_filename
Loading
Loading