Reddit Research Toolkit

A straightforward tool to scrape entire subreddits, including all posts and comments, and save them to CSV or JSON files.

Features

Scrape posts from any subreddit
Automatically scrape all comments from each post
Export data to CSV or JSON format
Simple command-line interface
Progress bars to track scraping status
Filter by time period and sort method

Quick Start

1. Installation

# Clone or download this repository
cd reddit-scraper-toolkit

# Install dependencies
pip install -r requirements.txt

2. Get Reddit API Credentials

Go to https://www.reddit.com/prefs/apps
Click "Create App" or "Create Another App"
Fill in the form:
- name: Your app name (e.g., "My Scraper")
- App type: Select "script"
- redirect uri: http://localhost:8080 (required but not used)
Click "Create app"
Note your client_id (under the app name) and client_secret

3. Configure Credentials

Create a .env file in the project directory:

REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=MyRedditScraper/1.0

4. Run the Scraper

# Scrape a subreddit (basic usage)
python main.py python

# Scrape with custom limit
python main.py python -l 500

# Scrape top posts from all time
python main.py python --sort top -t all

# Get subreddit info only
python main.py python --info

Usage Examples

Basic Scraping

# Scrape the first 1000 hot posts from r/python
python main.py python

# Scrape 500 posts
python main.py python -l 500

# Scrape without comments (faster)
python main.py python --no-comments

Advanced Options

# Scrape top posts from the past year
python main.py MachineLearning --sort top -t year -l 1000

# Scrape new posts and save as JSON
python main.py technology --sort new -f json

# Save in both CSV and JSON formats
python main.py datascience -f both

Get Subreddit Information

# Display info about a subreddit
python main.py python --info

Command-Line Options

positional arguments:
  subreddit             Subreddit name to scrape (without r/)

optional arguments:
  -h, --help            Show help message
  -l, --limit LIMIT     Maximum number of posts to scrape (default: 1000)
  --sort {hot,new,top,rising}
                        How to sort posts (default: hot)
  -t, --time-filter {hour,day,week,month,year,all}
                        Time filter for "top" sort (default: all)
  -f, --format {csv,json,both}
                        Output format (default: csv)
  --no-comments         Skip scraping comments (posts only)
  --info                Display subreddit information and exit

Output Files

The scraper creates an output/ directory with the following files:

CSV Format

{subreddit}_posts_{timestamp}.csv - All post data
{subreddit}_comments_{timestamp}.csv - All comment data

JSON Format

{subreddit}_data_{timestamp}.json - Combined posts and comments

Data Fields

Posts CSV/JSON includes:

post_id, title, author, created_utc
score, upvote_ratio, num_comments
permalink, url, selftext
is_self, is_video, over_18, spoiler, stickied
flair, domain

Comments CSV/JSON includes:

comment_id, post_id, parent_id
author, body, score
created_utc, edited
is_submitter, stickied, depth
permalink

Project Structure

reddit-research-toolkit/
├── main.py                    # Simple scraper entry point
├── simple_scraper.py          # Simple scraper (posts + comments → CSV/JSON)
├── requirements.txt           # Core dependencies
├── .env.example               # Example credentials (copy to .env)
├── config.yaml.example        # Example config (copy to config.yaml)
├── core/
│   └── reddit_scraper.py      # Universal scraper (user-centric, SQLite)
├── analyzers/
│   ├── temporal_analyzer.py   # Time-series analysis
│   ├── text_analyzer.py       # NLP / topic modeling
│   └── network_analyzer.py    # User interaction networks
├── tools/
│   └── cli.py                 # Full CLI for scrape/analyze/config/status
├── config/
│   ├── config_manager.py      # Config loading and validation
│   ├── requirements.txt       # Full dependency list (analysis features)
│   └── templates/             # YAML config templates
├── examples/                  # Usage examples
├── scripts/                   # Helper shell scripts
└── README.md

Important Notes

Rate Limiting: The scraper respects Reddit's API rate limits automatically
Reddit API Limits: Reddit's API returns a maximum of ~1000 posts per request
Large Subreddits: For very large subreddits, consider using filters (time period, sort method)
Comments: Scraping comments can take longer; use --no-comments for faster posts-only scraping
Credentials: Never commit your .env file with real credentials to version control

Troubleshooting

"Reddit API credentials not found"

Make sure you have a .env file with valid credentials. See step 3 in Quick Start.

"403 Forbidden" or Rate Limit Errors

Reddit is rate limiting your requests. The scraper should handle this automatically, but you may need to wait a bit and try again.

No Comments Scraped

Some posts may have deleted comments or the subreddit might not allow comments. Check the num_comments field in the posts data.

Slow Scraping

Scraping comments can be time-consuming. Use --no-comments to only scrape posts, or reduce the --limit value.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built upon the excellent PRAW (Python Reddit API Wrapper)
Network analysis powered by NetworkX
Text analysis using NLTK and scikit-learn
Visualizations created with Plotly

📚 Academic Usage

Citation

If you use this toolkit in academic research, please cite:

@software{reddit_research_toolkit,
  title={Reddit Research Toolkit: Comprehensive Social Media Data Collection and Analysis},
  author={zmuhls},
  year={2024},
  url={https://github.com/zmuhls/reddit-research-toolkit}
}

Ethical Considerations

Respect Reddit's Terms of Service: This toolkit is designed to work within Reddit's API guidelines
Rate Limiting: Built-in delays prevent server overload
Data Privacy: No personal information beyond usernames is collected
Academic Use: Designed for research purposes with appropriate ethical oversight

Research Applications

Digital Humanities: Online community analysis, discourse studies
Computer Science: Social network analysis, natural language processing
Psychology: Online behavior patterns, community psychology
Marketing: Consumer sentiment, brand analysis
Sociology: Digital sociology, community formation studies

🔗 Related Projects

PRAW Documentation: Python Reddit API Wrapper
Reddit Research Tools: Alternative Reddit data access
Social Network Analysis Papers: Academic research using similar methodologies

📞 Support

Issues: Report bugs and request features via GitHub Issues
Discussions: Join conversations in GitHub Discussions

Happy scraping and analyzing! 🎉

Remember to always respect Reddit's terms of service and rate limits. This toolkit is designed for ethical research and analysis purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analyzers		analyzers
config		config
core		core
examples		examples
scripts		scripts
src		src
tools		tools
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
OPTIMIZATION_GUIDE.md		OPTIMIZATION_GUIDE.md
README.md		README.md
config.yaml.example		config.yaml.example
main.py		main.py
requirements.txt		requirements.txt
simple_scraper.py		simple_scraper.py

Folders and files

Latest commit

History

Repository files navigation

Reddit Research Toolkit

Features

Quick Start

1. Installation

2. Get Reddit API Credentials

3. Configure Credentials

4. Run the Scraper

Usage Examples

Basic Scraping

Advanced Options

Get Subreddit Information

Command-Line Options

Output Files

CSV Format

JSON Format

Data Fields

Project Structure

Important Notes

Troubleshooting

"Reddit API credentials not found"

"403 Forbidden" or Rate Limit Errors

No Comments Scraped

Slow Scraping

📜 License

🙏 Acknowledgments

📚 Academic Usage

Citation

Ethical Considerations

Research Applications

🔗 Related Projects

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages