Telegram Scraper

The Telegram Scraper is a Go-based application designed to scrape messages and metadata from Telegram channels using the TDLib library. It collects and processes messages, storing the results in either local files or Azure Blob Storage for further analysis.

Features

Connects to Telegram channels via the TDLib API.
Scrapes and processes messages, metadata, and engagement statistics.
Supports various message types (e.g., text, video, photo).
Saves data in JSONL format for easy integration with data pipelines.
Optionally integrates with Azure Blob Storage for scalable cloud storage.
Supports incremental crawling with progress tracking to resume interrupted crawls.

Requirements

Go: Ensure Go is installed on your system.
TDLib: The Telegram Database Library must be installed. See TDLib Installation section.
Telegram API Credentials: You will need:
- TG_API_ID
- TG_API_HASH
- TG_PHONE_NUMBER
- TG_PHONE_CODE (OTP sent to your phone)
Environment Variables: Set up the following based on your use case.

TDLib Installation

macOS

brew install tdlib

Ubuntu/Debian

sudo apt-get update
sudo apt-get install -y build-essential cmake gperf libssl-dev zlib1g-dev
git clone https://github.com/tdlib/td.git
cd td
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
sudo make install

Environment Variables

Required for Telegram API

TG_API_ID: Telegram API ID (obtained from https://my.telegram.org).
TG_API_HASH: Telegram API Hash (obtained from https://my.telegram.org).
TG_PHONE_NUMBER: Your Telegram phone number with country code (e.g., +12025551234).
TG_PHONE_CODE: OTP sent to your phone by Telegram during authentication.

Optional for Azure Blob Storage

CONTAINER_NAME: Name of the Azure Blob Storage container.
BLOB_NAME: Name of the blob path to store scraped data.
AZURE_STORAGE_ACCOUNT_URL: Azure Storage account URL.

Optional for Custom Configuration

SEED_LIST: Comma-separated list of Telegram channel usernames to scrape (or provide as a command-line argument with -seed-list).
STORAGE_DIR: Custom directory path for local storage (default: ./storage).
MAX_MESSAGES: Maximum number of messages to scrape per channel (default: all messages).
LOG_LEVEL: Logging level (default: "info", options: "debug", "info", "warn", "error").

Directory Structure

/storage: Default local directory for storing progress and scraped data.
- /crawls/{crawlid}/channel_name/data.jsonl: Output for scraped channel data.
- /crawls/{crawlid}/progress.json: Tracks crawling progress for resumption.

Installation

Clone the repository:

git clone https://github.com/your-repo/telegram-scraper.git
cd telegram-scraper

Install Dependencies:
```
go mod tidy
```
Build the project:
```
go build -o telegram-scraper
```

Usage

CLI Commands and Options

The scraper supports the following command-line arguments:

Usage: ./telegram-scraper [options]

Options:
  -seed-list string      Comma-separated list of Telegram channel usernames to scrape
  -crawl-id string       Specify a custom crawl ID for tracking (default: auto-generated UUID)
  -storage-dir string    Directory for storing data locally (default: "./storage")
  -max-messages int      Maximum number of messages to scrape per channel (default: all)
  -resume                Resume from the last successful crawl point (default: false)
  -azure                 Enable Azure Blob Storage upload (default: false if env vars not set)
  -log-level string      Set logging level: debug, info, warn, error (default: "info")
  -help                  Display this help message

Running the Scraper

Basic Usage

Run the scraper with a seed list of channel usernames:

./telegram-scraper -seed-list "channel1,channel2,channel3"

With Environment Setup on macOS

For macOS users, you need to set specific environment variables for the TDLib compilation:

export CGO_CFLAGS=-I/opt/homebrew/include
export CGO_LDFLAGS=-L/opt/homebrew/lib -lssl -lcrypto
./telegram-scraper -seed-list "channel1,channel2,channel3"

Resuming a Crawl

To resume an interrupted crawl:

./telegram-scraper -resume -crawl-id "your-previous-crawl-id"

Limiting Message Count

To limit the number of messages scraped per channel:

./telegram-scraper -seed-list "channel1,channel2" -max-messages 1000

Custom Storage Directory

To specify a custom storage location:

./telegram-scraper -seed-list "channel1" -storage-dir "/path/to/custom/dir"

Azure Blob Storage Integration

If Azure Blob Storage is enabled via environment variables (CONTAINER_NAME, BLOB_NAME, and AZURE_STORAGE_ACCOUNT_URL), the scraper will automatically upload data to the specified container and blob.

To explicitly enable Azure upload:

./telegram-scraper -seed-list "channel1" -azure

Authentication Flow

When running the scraper for the first time:

Enter your phone number when prompted (with country code, e.g., +12025551234)
Enter the authentication code sent to your Telegram app
If you have Two-Factor Authentication enabled, enter your password when prompted

The auth session will be saved locally for future use.

Key Files and Functions

main.go: Entry point for the application, managing scraping and progress tracking.
telegramhelper/: Handles TDLib client connections and Telegram API interactions.
state/: Manages progress, seed list setup, and storage (local and Azure Blob).
models/: Defines data structures for storing and processing Telegram messages.
storage/: Implements storage adapters for both local and cloud storage.

Data Output Format

The scraper outputs data in JSONL format with the following structure:

{
  "message_id": 12345,
  "channel_id": 1001234567890,
  "channel_username": "example_channel",
  "date": "2023-04-01T12:34:56Z",
  "content": {
    "type": "text",
    "text": "Example message content",
    "media_url": "",
    "media_type": ""
  },
  "views": 1000,
  "forwards": 50,
  "replies": 25,
  "is_forwarded": false,
  "forward_source": "",
  "metadata": {
    "scraped_at": "2023-04-02T09:00:00Z",
    "crawl_id": "abc123def456"
  }
}

Example

To scrape Telegram data for @examplechannel and @anotherchannel, storing results in Azure Blob Storage:

Set environment variables:

export TG_API_ID="your_api_id"
export TG_API_HASH="your_api_hash"
export TG_PHONE_NUMBER="your_phone_number"
export TG_PHONE_CODE="your_telegram_code"
export CONTAINER_NAME="your_container_name"
export BLOB_NAME="your_blob_path"
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"

Run the scraper:

./telegram-scraper -seed-list "examplechannel,anotherchannel"

Handling Rate Limits and Errors

The scraper implements exponential backoff for handling rate limits from the Telegram API. If you encounter persistent rate limiting:

Reduce the number of channels in your seed list
Increase delay between requests by modifying the code in telegramhelper/client.go
Consider using a different Telegram account with fewer API calls

Troubleshooting

Authentication Issues: Ensure Telegram API credentials are correct. Delete the ./tdlib-db directory to restart authentication.
TDLib Errors: Check that TDLib is properly installed and accessible.
Storage Errors: Verify write permissions to the storage directory.
Azure Upload Failures: Confirm your Azure Blob Storage configuration and credentials.
macOS Compilation Errors: Set the required CGO environment variables as described in the Usage section.
Log Analysis: Set -log-level debug for more detailed logging information.

License

This project is licensed under Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.github		.github
.idea		.idea
common		common
crawl		crawl
crawler		crawler
dapr		dapr
model		model
resources		resources
standalone		standalone
state		state
telegramhelper		telegramhelper
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.tdlib		Dockerfile.tdlib
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
start-dapr-debug.sh		start-dapr-debug.sh
tdlib-scraper.iml		tdlib-scraper.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telegram Scraper

Features

Requirements

TDLib Installation

macOS

Ubuntu/Debian

Environment Variables

Required for Telegram API

Optional for Azure Blob Storage

Optional for Custom Configuration

Directory Structure

Installation

Usage

CLI Commands and Options

Running the Scraper

Basic Usage

With Environment Setup on macOS

Resuming a Crawl

Limiting Message Count

Custom Storage Directory

Azure Blob Storage Integration

Authentication Flow

Key Files and Functions

Data Output Format

Example

Handling Rate Limits and Errors

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Telegram Scraper

Features

Requirements

TDLib Installation

macOS

Ubuntu/Debian

Environment Variables

Required for Telegram API

Optional for Azure Blob Storage

Optional for Custom Configuration

Directory Structure

Installation

Usage

CLI Commands and Options

Running the Scraper

Basic Usage

With Environment Setup on macOS

Resuming a Crawl

Limiting Message Count

Custom Storage Directory

Azure Blob Storage Integration

Authentication Flow

Key Files and Functions

Data Output Format

Example

Handling Rate Limits and Errors

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages