Focus: Robust Ingestion Pipeline & Chroma Integration
The system follows a Producer-Consumer pattern using the filesystem as the queue. This decouples the "Capture" (Mobile/Web) from the "Processing" (Mac).
graph TD
A[iOS Shortcut] -->|Save JSON| B(iCloud Drive / Inbox)
C[Web Extension] -->|Save JSON| B
D[Mac Daemon] -->|Watch| B
B -->|Process| D
D -->|Extract Content| E[Connectors]
E -->|Web Scraper| F[Markdown Artifact]
E -->|Twitter Fetcher| F
F -->|Embed| G[ChromaDB]
F -->|Save| H[Local Archive Folder]
- Library:
watchdog(Python). - Responsibility: Monitors
~/Documents/OriginSteward/Inbox. - Logic:
- Event:
FileCreated. - Wait 1s (debounce/ensure write complete).
- Read file content.
- Determine type (
url,tweet,text) based on file extension or JSON content. - Dispatch to Connector.
- On Success: Move to
Archive/YYYY-MM-DD/. - On Failure: Move to
Error/witherror.log.
- Event:
- Libraries:
requests,beautifulsoup4,readability-lxml. - Logic:
GETurl.- Pass HTML to
Document(html).summary()(Readability). - Convert HTML summary to Markdown (
markdownify). - Extract Metadata: Title, Domain, Date.
- Error Handling:
- If
requestsfails (404/500): Log error, move JSON toError/folder. - If parsing fails: Save raw HTML to
Archive/with_raw.htmlextension.
- If
- Library:
yt-dlp(via subprocess). - Logic:
- Command:
yt-dlp --dump-json --skip-download [URL] - Extract:
description(text),uploader_id(handle),upload_date.
- Command:
- Error Handling:
- If
yt-dlpfails (e.g., "Video unavailable" or Rate Limit):- Log specific error code.
- Fallback: Create a Markdown file with just the URL and tag
#to_read. - Move original JSON to
Archive/(don't discard).
- If
- Library:
chromadb. - Configuration:
PersistentClient(path="./chroma_db"). - Embedding Model:
sentence-transformers/all-MiniLM-L6-v2(Fast, local, good enough). - Collection Schema:
ids: Filename (unique).documents: The full Markdown content.metadatas:{"source": "twitter", "author": "...", "date": "..."}.
The iOS Shortcut MUST generate a JSON file with this exact schema:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"type": { "enum": ["url", "tweet", "text"] },
"payload": { "type": "string" },
"timestamp": { "type": "number" },
"note": { "type": "string" }
},
"required": ["type", "payload", "timestamp"]
}Example Payload:
{
"type": "url",
"payload": "https://twitter.com/user/status/123456",
"timestamp": 1702480000,
"note": "Check this thread out later"
}Saved as Archive/2025-12-13/tweet_123456.md:
---
id: tweet_123456
source: https://twitter.com/user/status/123456
author: @user
date: 2025-12-13
tags: [twitter, ai]
---
# Tweet by @user
Here is the full text of the tweet...
> User comment: This is why I saved this.- Setup Env:
pip install chromadb watchdog beautifulsoup4 markdownify. - Create Dir Structure:
Inbox,Archive,Error. - Write Daemon: Basic loop to print "New file found".
- Write Web Connector: Test with a sample URL.
- Integrate Chroma: Index the output of the Web Connector.