RS Simple Scraper (TextMiner)

A simple Rust-based web scraper for Korean community websites. This tool periodically scrapes posts from DC Inside, Femco, and MLB Park, saves them to JSON files, and downloads images from specified posts.

Features

Multi-site Scraping: Supports DC Inside (dc), Femco (fm), and MLB Park (mp/mp_low)
Automatic Image Download: Downloads images from DC Inside posts that match download criteria
JSON Storage: Saves scraped posts to JSON files with timestamps
Continuous Monitoring: Runs in a loop, checking for new posts every 5 minutes
Firefox Integration: Uses Selenium WebDriver for JavaScript-heavy pages
Nick Filtering: Filters out posts from specified users
Timezone Support: Uses Seoul timezone for timestamps

Project Structure

.
├── src/
│   ├── main.rs          # Main application logic
│   ├── utils.rs         # Utility functions for HTTP requests, file operations
│   └── foxfox.rs        # Firefox WebDriver integration
├── site.json            # Configuration for sites to scrape
├── save.json            # Configuration for save paths
├── down.json            # Configuration for download targets
├── nick.json            # Configuration for filtered nicknames
├── Cargo.toml
├── Cargo.lock
└── README.md

Dependencies

tokio: Asynchronous runtime
reqwest: HTTP client
scraper: HTML parsing
serde: Serialization
chrono: Date/time handling
thirtyfour: Selenium WebDriver for Firefox
regex: Regular expressions

Setup

Install Rust: Make sure you have Rust installed. Download from rustup.rs

Clone the repository:

git clone https://github.com/sipubot/RS-simple-scraper.git
cd RS-simple-scraper

Install dependencies:
```
cargo build --release
```
Install GeckoDriver: For Firefox automation, download GeckoDriver from mozilla/geckodriver and ensure it's in your PATH.
Configure JSON files:
- site.json: List of sites to scrape with host and URL
- save.json: Save paths for each host
- down.json: Download targets (titles to match for image downloads)
- nick.json: Nicknames to filter out

Usage

Run the scraper:
```
cargo run --release
```
The application will start monitoring the configured sites and save new posts to JSON files.
Images will be downloaded to the specified paths when matching posts are found.

Configuration Files

site.json

[
    {
        "host": "dc",
        "url": "https://gall.dcinside.com/board/lists/?id=baseball_new12&exception_mode=recommend"
    }
]

save.json

[
    {
        "host": "dc",
        "json_path": "./data/dc_posts.json"
    }
]

down.json

[
    {
        "host": "dc",
        "title": "some_title",
        "path": "./downloads/dc/"
    }
]

nick.json

[
    {
        "nick": "filtered_user"
    }
]

How It Works

Loads configuration from JSON files
Enters a loop that runs every 5 minutes
For each configured site:
- Fetches HTML content
- Parses posts using CSS selectors
- Filters posts by time (last 48 hours) and nickname
- Saves new posts to JSON
For DC Inside posts matching download criteria:
- Uses Firefox WebDriver to load the page (handles dynamic content)
- Extracts image URLs
- Downloads images with proper referer headers

Notes

The scraper respects rate limits by running every 5 minutes
Images are downloaded with referer headers to avoid 403 errors
Posts older than 48 hours are automatically cleaned up
Logging is saved to ./log/ directory with monthly rotation

Development

Built with Rust 2018 edition
Uses async/await for concurrent operations
Modular design with separate modules for scraping logic

License

This project is private and for personal use.

Author

SIPU ddasik00@naver.com

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.ai		.ai
.github		.github
src		src
.agent.md		.agent.md
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
GECKODRIVER_INFO.md		GECKODRIVER_INFO.md
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RS Simple Scraper (TextMiner)

Features

Project Structure

Dependencies

Setup

Usage

Configuration Files

site.json

save.json

down.json

nick.json

How It Works

Notes

Development

License

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RS Simple Scraper (TextMiner)

Features

Project Structure

Dependencies

Setup

Usage

Configuration Files

site.json

save.json

down.json

nick.json

How It Works

Notes

Development

License

Author

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages