A comprehensive Discord message scraping system designed to collect, store, and provide access to community discussions across Bittensor subnet channels. This system combines historical data backfilling with real-time message collection, storing everything in PostgreSQL for easy querying and future website integration.
- Historical Backfill: Import past messages using DiscordChatExporter
- Real-Time Collection: Live Discord bot captures new messages as they're posted
- Scheduled Exports: Alternative approach for users without admin permissions (no bot needed)
- Subnet-Based Organization: Messages organized by subnet for easy filtering
- Full Message Metadata: Captures attachments, embeds, reactions, edits, and deletions
- PostgreSQL Storage: Reliable, queryable database with full-text search
- Lambda-Ready API: Future-proof endpoints for website integration
- Scalable Architecture: Handles single or multiple Discord servers
Best for: Admins or those with permission to add a bot to the server
- ✅ Real-time message capture
- ✅ Edit/delete tracking
- ✅ Instant updates
⚠️ Requires bot invite permissions
See: SETUP_GUIDE.md
Best for: Users without admin permissions
- ✅ No bot required
- ✅ Fully automated
- ✅ Same database/API
⚠️ Not real-time (scheduled updates)
See: QUICKSTART_SCHEDULED.md or SCHEDULED_SETUP_GUIDE.md
Want to run the API locally alongside the bittensor-ai-website for integrated testing?
See:
- 📖 API_LOCAL_TESTING.md - Complete guide for running the API and website locally
- ⚡ API_QUICK_REFERENCE.md - Quick reference card for endpoints and commands
# Terminal 1: Start API
python api/summaries_api.py
# Terminal 2: Start Website
cd ../bittensor-ai-website
npm run dev
# Then open http://localhost:3000 in browser-
Database Layer (
db/)- PostgreSQL schema with optimized indexes
- SQLAlchemy models for all entities
- Service layer for database operations
-
Discord Bot (
bot/)- Real-time message collection via Discord Gateway API
- Automatic server/channel synchronization
- Handles message edits and deletions
-
Backfill Scripts (
scripts/)- DiscordChatExporter wrapper for historical data
- Database setup and initialization
- Subnet configuration management
-
API Handlers (
api/)- Lambda-ready query endpoints
- RESTful message retrieval
- Subnet-based filtering
- Python 3.9+
- PostgreSQL 12+
- Discord Bot Token (Create one here)
- DiscordChatExporter CLI (Download here)
-
Clone the repository
cd discordscraper -
Install Python dependencies
pip install -r requirements.txt
-
Setup PostgreSQL
Option A: Using Docker (recommended for local development)
docker-compose up -d postgres
Option B: Use existing PostgreSQL instance
- Create a database named
discord_scraper - Note the connection details for configuration
- Create a database named
-
Configure environment variables
cp .env.example .env
Edit
.envwith your credentials:# Discord DISCORD_BOT_TOKEN=your_bot_token_here DISCORD_SERVER_ID=your_server_id_here # Database DB_HOST=localhost DB_PORT=5432 DB_NAME=discord_scraper DB_USER=postgres DB_PASSWORD=your_password # DiscordChatExporter DCE_PATH=path/to/DiscordChatExporter.Cli.exe DCE_TOKEN=your_discord_token
-
Configure subnet mappings
Edit
config/subnets.yaml:subnets: - name: "subnet-1" channel_id: "1234567890123456789" description: "Subnet 1 discussions" tags: ["ai", "compute"] - name: "subnet-2" channel_id: "9876543210987654321" description: "Subnet 2 discussions" tags: ["storage"]
-
Initialize the database
python scripts/setup_db.py
-
Create a Discord Application
- Go to Discord Developer Portal
- Click "New Application"
- Go to "Bot" section and click "Add Bot"
-
Configure Bot Permissions Required permissions:
- Read Messages/View Channels
- Read Message History
- Send Messages (optional, for bot commands)
-
Enable Intents In the Bot settings, enable:
- Server Members Intent
- Message Content Intent
-
Invite Bot to Server
- Go to OAuth2 > URL Generator
- Select scopes:
bot - Select permissions:
Read Messages,Read Message History - Use generated URL to invite bot to your server
Start the bot to begin collecting messages in real-time:
python run_bot.pyThe bot will:
- Connect to Discord
- Sync server and channel information
- Start collecting new messages
- Store everything in PostgreSQL
Import historical messages from configured channels:
python scripts/backfill.pyThis will:
- Export messages using DiscordChatExporter
- Parse exported JSON files
- Import messages to database
- Track backfill job status
You can query messages directly from the database or use the API handlers.
from db.service import DatabaseService
from datetime import datetime, timedelta
# Initialize database
db = DatabaseService("postgresql://user:pass@localhost/discord_scraper")
# Get recent messages for a subnet
messages = db.get_messages_by_subnet("subnet-1", limit=50)
# Get messages in a time range
start = datetime.now() - timedelta(days=7)
end = datetime.now()
messages = db.get_messages_by_timerange(start, end, subnet_name="subnet-1")
# Search messages
messages = db.search_messages("bittensor", subnet_name="subnet-1")-- Get recent subnet messages
SELECT * FROM subnet_messages
WHERE subnet_name = 'subnet-1'
ORDER BY timestamp DESC
LIMIT 100;
-- Search message content
SELECT * FROM messages
WHERE content ILIKE '%bittensor%'
AND deleted = FALSE
ORDER BY timestamp DESC;
-- Get message count by subnet
SELECT s.name, COUNT(m.message_id) as message_count
FROM subnets s
JOIN channels c ON s.id = c.subnet_id
JOIN messages m ON c.channel_id = m.channel_id
WHERE m.deleted = FALSE
GROUP BY s.name;servers: Discord server metadatasubnets: Subnet configuration and mappingschannels: Channel information with subnet linksusers: Discord user cachemessages: All messages with full metadatabackfill_jobs: Backfill job tracking
- Idempotent inserts: Duplicate messages are automatically ignored
- Full-text search: Optimized indexes for content search
- Soft deletes: Deleted messages are marked, not removed
- Audit trail: Created/updated timestamps on all tables
The api/handler.py file contains Lambda-ready functions for website integration:
GET /messages?subnet=subnet-1&from=2024-01-01T00:00:00Z&to=2024-01-31T23:59:59Z&limit=50Parameters:
subnet: Subnet namechannel_id: Specific channel ID (optional)from: Start timestamp (ISO format)to: End timestamp (ISO format)limit: Max results (default 100)offset: Pagination offsetsearch: Full-text search term
Response:
{
"messages": [
{
"message_id": "123456789",
"channel_id": "987654321",
"user_id": "111111111",
"content": "Message text",
"timestamp": "2024-01-15T10:30:00Z",
"attachments": [],
"embeds": [],
"reactions": []
}
],
"count": 1,
"limit": 50,
"offset": 0
}The system is designed for easy integration with Bittensor.ai:
- Deploy Lambda Functions: Use
api/handler.pyfunctions - Setup API Gateway: Configure REST API endpoints
- Frontend Integration: Call API endpoints to display messages
- Real-Time Updates: Bot continues to collect data in background
Example frontend query:
// Fetch recent subnet discussions
const response = await fetch(
'https://api.bittensor.ai/messages?subnet=subnet-1&limit=100'
);
const data = await response.json();discordscraper/
├── bot/
│ ├── __init__.py
│ └── discord_bot.py # Real-time Discord bot
├── db/
│ ├── __init__.py
│ ├── schema.sql # PostgreSQL schema
│ ├── models.py # SQLAlchemy models
│ └── service.py # Database operations
├── scripts/
│ ├── __init__.py
│ ├── backfill.py # Historical data import
│ └── setup_db.py # Database initialization
├── api/
│ ├── __init__.py
│ └── handler.py # Lambda-ready endpoints
├── config/
│ ├── config.yaml # Main configuration
│ └── subnets.yaml # Subnet mappings
├── logs/ # Application logs
├── exports/ # DiscordChatExporter output
├── requirements.txt # Python dependencies
├── docker-compose.yml # Local PostgreSQL
├── run_bot.py # Bot entry point
└── README.md
Application logs are stored in logs/discord_bot.log. Monitor for:
- Connection errors
- Database issues
- Failed message processing
Periodic maintenance tasks:
-- Vacuum and analyze
VACUUM ANALYZE messages;
-- Check table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Reindex for performance
REINDEX TABLE messages;Check backfill job status:
from db.service import DatabaseService
db = DatabaseService("postgresql://user:pass@localhost/discord_scraper")
jobs = db.get_backfill_jobs(status='failed')
for job in jobs:
print(f"Job {job.id}: {job.error_message}")- Check bot token is correct in
.env - Verify bot is invited to the server
- Ensure Message Content Intent is enabled
- Verify PostgreSQL is running:
docker-compose ps - Check connection details in
.env - Test connection:
psql -h localhost -U postgres -d discord_scraper
- Verify DCE_PATH points to correct executable
- Ensure DCE_TOKEN is valid
- Check exports directory exists and is writable
- Check bot has permission to read channel
- Verify channel is mapped in
subnets.yaml - Check backfill job status for errors
pytest tests/black .
flake8 .- Edit
config/subnets.yaml - Add new subnet configuration
- Restart the bot or run
setup_db.pyagain - Optionally run backfill for the new channel
[Add your license here]
[Add contribution guidelines here]
For issues or questions:
- GitHub Issues: [Your repo URL]
- Discord: [Your Discord server]
- Email: [Your contact email]