A straightforward tool to scrape entire subreddits, including all posts and comments, and save them to CSV or JSON files.
- Scrape posts from any subreddit
- Automatically scrape all comments from each post
- Export data to CSV or JSON format
- Simple command-line interface
- Progress bars to track scraping status
- Filter by time period and sort method
# Clone or download this repository
cd reddit-scraper-toolkit
# Install dependencies
pip install -r requirements.txt- Go to https://www.reddit.com/prefs/apps
- Click "Create App" or "Create Another App"
- Fill in the form:
- name: Your app name (e.g., "My Scraper")
- App type: Select "script"
- redirect uri: http://localhost:8080 (required but not used)
- Click "Create app"
- Note your client_id (under the app name) and client_secret
Create a .env file in the project directory:
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=MyRedditScraper/1.0# Scrape a subreddit (basic usage)
python main.py python
# Scrape with custom limit
python main.py python -l 500
# Scrape top posts from all time
python main.py python --sort top -t all
# Get subreddit info only
python main.py python --info# Scrape the first 1000 hot posts from r/python
python main.py python
# Scrape 500 posts
python main.py python -l 500
# Scrape without comments (faster)
python main.py python --no-comments# Scrape top posts from the past year
python main.py MachineLearning --sort top -t year -l 1000
# Scrape new posts and save as JSON
python main.py technology --sort new -f json
# Save in both CSV and JSON formats
python main.py datascience -f both# Display info about a subreddit
python main.py python --infopositional arguments:
subreddit Subreddit name to scrape (without r/)
optional arguments:
-h, --help Show help message
-l, --limit LIMIT Maximum number of posts to scrape (default: 1000)
--sort {hot,new,top,rising}
How to sort posts (default: hot)
-t, --time-filter {hour,day,week,month,year,all}
Time filter for "top" sort (default: all)
-f, --format {csv,json,both}
Output format (default: csv)
--no-comments Skip scraping comments (posts only)
--info Display subreddit information and exit
The scraper creates an output/ directory with the following files:
{subreddit}_posts_{timestamp}.csv- All post data{subreddit}_comments_{timestamp}.csv- All comment data
{subreddit}_data_{timestamp}.json- Combined posts and comments
Posts CSV/JSON includes:
- post_id, title, author, created_utc
- score, upvote_ratio, num_comments
- permalink, url, selftext
- is_self, is_video, over_18, spoiler, stickied
- flair, domain
Comments CSV/JSON includes:
- comment_id, post_id, parent_id
- author, body, score
- created_utc, edited
- is_submitter, stickied, depth
- permalink
reddit-research-toolkit/
├── main.py # Simple scraper entry point
├── simple_scraper.py # Simple scraper (posts + comments → CSV/JSON)
├── requirements.txt # Core dependencies
├── .env.example # Example credentials (copy to .env)
├── config.yaml.example # Example config (copy to config.yaml)
├── core/
│ └── reddit_scraper.py # Universal scraper (user-centric, SQLite)
├── analyzers/
│ ├── temporal_analyzer.py # Time-series analysis
│ ├── text_analyzer.py # NLP / topic modeling
│ └── network_analyzer.py # User interaction networks
├── tools/
│ └── cli.py # Full CLI for scrape/analyze/config/status
├── config/
│ ├── config_manager.py # Config loading and validation
│ ├── requirements.txt # Full dependency list (analysis features)
│ └── templates/ # YAML config templates
├── examples/ # Usage examples
├── scripts/ # Helper shell scripts
└── README.md
- Rate Limiting: The scraper respects Reddit's API rate limits automatically
- Reddit API Limits: Reddit's API returns a maximum of ~1000 posts per request
- Large Subreddits: For very large subreddits, consider using filters (time period, sort method)
- Comments: Scraping comments can take longer; use
--no-commentsfor faster posts-only scraping - Credentials: Never commit your
.envfile with real credentials to version control
Make sure you have a .env file with valid credentials. See step 3 in Quick Start.
Reddit is rate limiting your requests. The scraper should handle this automatically, but you may need to wait a bit and try again.
Some posts may have deleted comments or the subreddit might not allow comments. Check the num_comments field in the posts data.
Scraping comments can be time-consuming. Use --no-comments to only scrape posts, or reduce the --limit value.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built upon the excellent PRAW (Python Reddit API Wrapper)
- Network analysis powered by NetworkX
- Text analysis using NLTK and scikit-learn
- Visualizations created with Plotly
If you use this toolkit in academic research, please cite:
@software{reddit_research_toolkit,
title={Reddit Research Toolkit: Comprehensive Social Media Data Collection and Analysis},
author={zmuhls},
year={2024},
url={https://github.com/zmuhls/reddit-research-toolkit}
}- Respect Reddit's Terms of Service: This toolkit is designed to work within Reddit's API guidelines
- Rate Limiting: Built-in delays prevent server overload
- Data Privacy: No personal information beyond usernames is collected
- Academic Use: Designed for research purposes with appropriate ethical oversight
- Digital Humanities: Online community analysis, discourse studies
- Computer Science: Social network analysis, natural language processing
- Psychology: Online behavior patterns, community psychology
- Marketing: Consumer sentiment, brand analysis
- Sociology: Digital sociology, community formation studies
- PRAW Documentation: Python Reddit API Wrapper
- Reddit Research Tools: Alternative Reddit data access
- Social Network Analysis Papers: Academic research using similar methodologies
- Issues: Report bugs and request features via GitHub Issues
- Discussions: Join conversations in GitHub Discussions
Happy scraping and analyzing! 🎉
Remember to always respect Reddit's terms of service and rate limits. This toolkit is designed for ethical research and analysis purposes.