Curator

Content acquisition and curation service for the PAI ecosystem.

Overview

Curator handles:

Content ingestion from multiple sources (YouTube, RSS, podcasts)
Subscription monitoring for new content
Content chunking for semantic search
Integration with Engram (embedding service) and Transcribe (transcription service)

Architecture

curator/
├── src/curator/
│   ├── api.py              # FastAPI REST API
│   ├── cli.py              # Click CLI
│   ├── config.py           # Pydantic Settings
│   ├── storage.py          # SQLite subscription database
│   ├── daemon.py           # APScheduler background daemon
│   ├── orchestrator.py     # Ingestion workflow orchestrator
│   ├── chunking.py         # Text chunking (from pai-corpus)
│   ├── models.py           # Pydantic models
│   └── plugins/
│       ├── base.py         # BasePlugin interface
│       ├── youtube.py      # YouTube plugin (from pai-ingest)
│       ├── rss.py          # RSS/Atom plugin (planned)
│       └── podcast.py      # Podcast plugin (planned)

Installation

cd ~/code/curator
pip install -e .

Usage

CLI

# Start API server
curator serve

# Ingest a single URL
curator ingest https://youtube.com/watch?v=VIDEO_ID

# Manage subscriptions
curator subscription list
curator subscription add "Channel Name" https://youtube.com/@channel
curator subscription remove 1

# List ingested items
curator items

# Run subscription daemon
curator daemon

API

Start the server:

curator serve

API endpoints:

GET /health - Health check
POST /subscriptions - Create subscription
GET /subscriptions - List subscriptions
GET /subscriptions/{id} - Get subscription
PATCH /subscriptions/{id} - Update subscription
DELETE /subscriptions/{id} - Delete subscription
POST /ingest - Ingest a URL
GET /ingest/{job_id} - Get ingestion job status
GET /items - List ingested items
GET /items/{id} - Get ingested item

Configuration

Environment variables (prefix with CURATOR_):

# Service
CURATOR_SERVICE_NAME=curator
CURATOR_ENVIRONMENT=development
CURATOR_DEBUG=false

# API
CURATOR_API_HOST=0.0.0.0
CURATOR_API_PORT=8003

# Database
CURATOR_DATABASE_URL=sqlite:///./curator.db

# Storage
CURATOR_DATA_DIR=./data

# External services
CURATOR_ENGRAM_URL=http://localhost:8001
CURATOR_TRANSCRIBE_URL=http://localhost:8002

# Plugin configuration
CURATOR_YOUTUBE_COOKIES_PATH=/path/to/cookies.txt

# Ingestion
CURATOR_DEFAULT_CHUNK_TOKENS=500
CURATOR_MAX_CONCURRENT_INGESTIONS=3

Development

Running tests

pytest tests/

Code structure

api.py: FastAPI REST endpoints
cli.py: Click command-line interface
storage.py: SQLite database operations
orchestrator.py: Ingestion workflow coordination
daemon.py: Background subscription monitoring
plugins/: Content source plugins (YouTube, RSS, etc.)

Deployment

See deploy/ directory for Docker and Podman configurations.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
deploy		deploy
docs		docs
scripts		scripts
src/curator		src/curator
tests		tests
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Curator

Overview

Architecture

Installation

Usage

CLI

API

Configuration

Development

Running tests

Code structure

Deployment

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Curator

Overview

Architecture

Installation

Usage

CLI

API

Configuration

Development

Running tests

Code structure

Deployment

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages