Skip to content

Commit eddea14

Browse files
committed
2 parents e6c8e8d + 3e1366b commit eddea14

50 files changed

Lines changed: 3191 additions & 347 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.idea/FastFetchBot.iml

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/runConfigurations/fullstack_polling.xml renamed to .idea/runConfigurations/fullstack_polling_api.xml

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

CLAUDE.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Project Overview
44

5-
FastFetchBot is a social media content fetching service built as a **UV workspace monorepo** with three microservices: a FastAPI server (API), a Telegram Bot client, and a Celery worker for file operations. It scrapes and archives content from various social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.
5+
FastFetchBot is a social media content fetching service built as a **UV workspace monorepo** with four microservices: a FastAPI server (API), a Telegram Bot client, a Celery worker for file operations, and an ARQ-based async worker for off-path scraping. It scrapes and archives content from various social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.
66

77
## Architecture
88

@@ -23,11 +23,13 @@ FastFetchBot/
2323
│ │ ├── twitter/ bluesky/ weibo/ xiaohongshu/ reddit/
2424
│ │ ├── instagram/ zhihu/ douban/ threads/ wechat/
2525
│ │ └── general/ # Firecrawl + Zyte generic scraping
26+
│ ├── file_export/ # Async Celery task wrappers (PDF, video, audio transcription)
2627
│ └── telegraph/ # Telegraph content publishing
27-
├── packages/file-export/ # fastfetchbot-file-export: video download, PDF export, transcription
28+
├── packages/file-export/ # fastfetchbot-file-export: synchronous Celery worker jobs (yt-dlp, WeasyPrint, OpenAI)
2829
├── apps/api/ # FastAPI server: enriched service, routing, storage
2930
├── apps/telegram-bot/ # Telegram Bot: webhook/polling, message handling
30-
├── apps/worker/ # Celery worker: async file operations (video, PDF, audio)
31+
├── apps/worker/ # Celery worker: sync file operations (video, PDF, audio)
32+
├── apps/async-worker/ # ARQ async worker: off-path scraping + enrichment
3133
├── pyproject.toml # Root workspace configuration
3234
└── uv.lock # Lockfile for the entire workspace
3335
```
@@ -37,6 +39,7 @@ FastFetchBot/
3739
| **API Server** (`apps/api/src/`) | `fastfetchbot-api` | 10450 | `gunicorn -k uvicorn.workers.UvicornWorker src.main:app --preload` |
3840
| **Telegram Bot** (`apps/telegram-bot/core/`) | `fastfetchbot-telegram-bot` | 10451 | `python -m core.main` |
3941
| **Worker** (`apps/worker/worker_core/`) | `fastfetchbot-worker` || `celery -A worker_core.main:app worker --loglevel=info --concurrency=2` |
42+
| **Async Worker** (`apps/async-worker/async_worker/`) | `fastfetchbot-async-worker` || `arq async_worker.main.WorkerSettings` |
4043
| **Shared Library** (`packages/shared/fastfetchbot_shared/`) | `fastfetchbot-shared` |||
4144
| **File Export Library** (`packages/file-export/fastfetchbot_file_export/`) | `fastfetchbot-file-export` |||
4245

@@ -74,6 +77,7 @@ The Telegram Bot communicates with the API server over HTTP (`API_SERVER_URL`).
7477
- **`templates/`** — 13 Jinja2 templates for platform-specific output formatting (bundled via `__file__`-relative paths)
7578
- **Platform modules**: `twitter/`, `bluesky/`, `weibo/`, `xiaohongshu/`, `reddit/`, `instagram/`, `zhihu/`, `douban/`, `threads/`, `wechat/`, `general/` (Firecrawl + Zyte)
7679
- **`services/telegraph/`** — Telegraph content publishing (creates telegra.ph pages from scraped content)
80+
- **`services/file_export/`** — Async Celery task wrappers for PDF export, video download, and audio transcription. These accept `celery_app` and `timeout` as constructor parameters (dependency injection) so any app can use them with its own Celery client
7781

7882
The shared scrapers library can be used standalone without the API server:
7983
```python
@@ -180,6 +184,8 @@ GitHub Actions (`.github/workflows/ci.yml`) builds and pushes all four images on
180184
7. Add any new pip dependencies to `packages/shared/pyproject.toml` under `[project.optional-dependencies] scrapers`
181185

182186
### Key Conventions
187+
- **`packages/shared/` (`fastfetchbot-shared`)** is for shared async logic — scrapers, templates, Telegraph, and async Celery task wrappers (file_export). Most code here is async and reusable across apps
188+
- **`packages/file-export/` (`fastfetchbot-file-export`)** is exclusively for synchronous Celery worker jobs — the heavy I/O operations that run inside the Celery worker process (yt-dlp video download, WeasyPrint PDF generation, OpenAI audio transcription). Apps never import this package directly; they use the async wrappers in `fastfetchbot_shared.services.file_export` which submit tasks to the Celery worker
183189
- **Scrapers, templates, and Telegraph live in `packages/shared/`** — they are framework-agnostic and reusable
184190
- Scraper config (platform credentials, Firecrawl/Zyte settings) lives in `fastfetchbot_shared.services.scrapers.config`, **not** in `apps/api/src/config.py`
185191
- API-only config (BASE_URL, MongoDB, Celery, AWS, Inoreader) stays in `apps/api/src/config.py`
Lines changed: 11 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,16 @@
1-
import asyncio
1+
"""API-layer audio transcription — wraps the shared AudioTranscribe with API config."""
22

3-
from src.config import DOWNLOAD_VIDEO_TIMEOUT
3+
from fastfetchbot_shared.services.file_export.audio_transcribe import AudioTranscribe as BaseAudioTranscribe
44
from src.services.celery_client import celery_app
5-
from fastfetchbot_shared.utils.logger import logger
5+
from src.config import DOWNLOAD_VIDEO_TIMEOUT
66

77

8-
class AudioTranscribe:
9-
def __init__(self, audio_file: str):
10-
self.audio_file = audio_file
8+
class AudioTranscribe(BaseAudioTranscribe):
9+
"""API AudioTranscribe that injects the API's Celery app and timeout."""
1110

12-
async def transcribe(self):
13-
return await self._get_audio_text(self.audio_file)
14-
15-
@staticmethod
16-
async def _get_audio_text(audio_file: str):
17-
logger.info(f"submitting transcribe task: {audio_file}")
18-
result = celery_app.send_task("file_export.transcribe", kwargs={
19-
"audio_file": audio_file,
20-
})
21-
try:
22-
response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
23-
return response["transcript"]
24-
except Exception:
25-
logger.exception(
26-
f"file_export.transcribe task failed: audio_file={audio_file}, "
27-
f"timeout={DOWNLOAD_VIDEO_TIMEOUT}"
28-
)
29-
raise
11+
def __init__(self, audio_file: str):
12+
super().__init__(
13+
audio_file=audio_file,
14+
celery_app=celery_app,
15+
timeout=DOWNLOAD_VIDEO_TIMEOUT,
16+
)
Lines changed: 14 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
import asyncio
2-
import uuid
1+
"""API-layer PDF export — extends the shared PdfExport with S3 upload support."""
2+
33
from pathlib import Path
44

55
import aiofiles.os
6-
from bs4 import BeautifulSoup
76

7+
from fastfetchbot_shared.services.file_export.pdf_export import PdfExport as BasePdfExport, wrap_html_string
88
from src.config import DOWNLOAD_VIDEO_TIMEOUT, AWS_STORAGE_ON
99
from src.services.celery_client import celery_app
1010
from src.services.amazon.s3 import upload as upload_to_s3
@@ -19,48 +19,23 @@ async def upload_file_to_s3(output_filename):
1919
)
2020

2121

22-
class PdfExport:
22+
class PdfExport(BasePdfExport):
23+
"""API PDF export that adds optional S3 upload after Celery PDF generation."""
24+
2325
def __init__(self, title: str, html_string: str = None):
24-
self.title = title
25-
self.html_string = html_string
26+
super().__init__(
27+
title=title,
28+
html_string=html_string,
29+
celery_app=celery_app,
30+
timeout=DOWNLOAD_VIDEO_TIMEOUT,
31+
)
2632

2733
async def export(self) -> str:
28-
html_string = self.wrap_html_string(self.html_string)
29-
output_filename = f"{self.title}-{uuid.uuid4()}.pdf"
30-
31-
logger.info(f"submitting pdf export task: {output_filename}")
32-
result = celery_app.send_task("file_export.pdf_export", kwargs={
33-
"html_string": html_string,
34-
"output_filename": output_filename,
35-
})
36-
try:
37-
response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
38-
output_filename = response["output_filename"]
39-
except Exception:
40-
logger.exception(
41-
f"file_export.pdf_export task failed: output_filename={output_filename}, "
42-
f"timeout={DOWNLOAD_VIDEO_TIMEOUT}"
43-
)
44-
raise
45-
logger.info(f"pdf export success: {output_filename}")
34+
output_filename = await super().export()
4635

4736
if AWS_STORAGE_ON:
4837
local_filename = output_filename
4938
output_filename = await upload_file_to_s3(Path(output_filename))
5039
await aiofiles.os.remove(local_filename)
51-
return output_filename
5240

53-
@staticmethod
54-
def wrap_html_string(html_string: str) -> str:
55-
soup = BeautifulSoup(
56-
'<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
57-
'<meta charset="UTF-8"></head><body></body></html>',
58-
"html.parser",
59-
)
60-
soup.body.append(BeautifulSoup(html_string, "html.parser"))
61-
for tag in soup.find_all(True):
62-
if "style" in tag.attrs:
63-
del tag["style"]
64-
for style_tag in soup.find_all("style"):
65-
style_tag.decompose()
66-
return soup.prettify()
41+
return output_filename

0 commit comments

Comments
 (0)