A simple Rust-based web scraper for Korean community websites. This tool periodically scrapes posts from DC Inside, Femco, and MLB Park, saves them to JSON files, and downloads images from specified posts.
- Multi-site Scraping: Supports DC Inside (dc), Femco (fm), and MLB Park (mp/mp_low)
- Automatic Image Download: Downloads images from DC Inside posts that match download criteria
- JSON Storage: Saves scraped posts to JSON files with timestamps
- Continuous Monitoring: Runs in a loop, checking for new posts every 5 minutes
- Firefox Integration: Uses Selenium WebDriver for JavaScript-heavy pages
- Nick Filtering: Filters out posts from specified users
- Timezone Support: Uses Seoul timezone for timestamps
.
├── src/
│ ├── main.rs # Main application logic
│ ├── utils.rs # Utility functions for HTTP requests, file operations
│ └── foxfox.rs # Firefox WebDriver integration
├── site.json # Configuration for sites to scrape
├── save.json # Configuration for save paths
├── down.json # Configuration for download targets
├── nick.json # Configuration for filtered nicknames
├── Cargo.toml
├── Cargo.lock
└── README.md
- tokio: Asynchronous runtime
- reqwest: HTTP client
- scraper: HTML parsing
- serde: Serialization
- chrono: Date/time handling
- thirtyfour: Selenium WebDriver for Firefox
- regex: Regular expressions
-
Install Rust: Make sure you have Rust installed. Download from rustup.rs
-
Clone the repository:
git clone https://github.com/sipubot/RS-simple-scraper.git cd RS-simple-scraper -
Install dependencies:
cargo build --release
-
Install GeckoDriver: For Firefox automation, download GeckoDriver from mozilla/geckodriver and ensure it's in your PATH.
-
Configure JSON files:
site.json: List of sites to scrape with host and URLsave.json: Save paths for each hostdown.json: Download targets (titles to match for image downloads)nick.json: Nicknames to filter out
-
Run the scraper:
cargo run --release
-
The application will start monitoring the configured sites and save new posts to JSON files.
-
Images will be downloaded to the specified paths when matching posts are found.
[
{
"host": "dc",
"url": "https://gall.dcinside.com/board/lists/?id=baseball_new12&exception_mode=recommend"
}
][
{
"host": "dc",
"json_path": "./data/dc_posts.json"
}
][
{
"host": "dc",
"title": "some_title",
"path": "./downloads/dc/"
}
][
{
"nick": "filtered_user"
}
]- Loads configuration from JSON files
- Enters a loop that runs every 5 minutes
- For each configured site:
- Fetches HTML content
- Parses posts using CSS selectors
- Filters posts by time (last 48 hours) and nickname
- Saves new posts to JSON
- For DC Inside posts matching download criteria:
- Uses Firefox WebDriver to load the page (handles dynamic content)
- Extracts image URLs
- Downloads images with proper referer headers
- The scraper respects rate limits by running every 5 minutes
- Images are downloaded with referer headers to avoid 403 errors
- Posts older than 48 hours are automatically cleaned up
- Logging is saved to
./log/directory with monthly rotation
- Built with Rust 2018 edition
- Uses async/await for concurrent operations
- Modular design with separate modules for scraping logic
This project is private and for personal use.
SIPU ddasik00@naver.com