Skip to content

Latest commit

 

History

History
42 lines (29 loc) · 1.91 KB

File metadata and controls

42 lines (29 loc) · 1.91 KB

advanced-data-mining-project

The project provides tools for scraping restaurant textual reviews, which are subsequently analysed using either traditional or advanced NLP approaches in order to get insight into relationship between written-form review and actual rating, helpfulness etc.

We recommend you to check out the page describing the conducted experiments.

Project structure

- .devcontainer   # Devcontainer setup.
- doc/            # Documentation, experiments description.
- scripts/        # Contains scripts for running scraping process, EDA etc.
- src/            # Source code.
- justfile        # Contains setup recipes, check out for installing deps etc.
- pyproject.toml  # Core project configuration (version, dependencies, dev deps).

Usage

The project's public API, available for the user, can be found in the scripts directory. It contains scripts for running data scraping & processing pipelines, models training and obtaining visualizations and summaries.

just setup_env
just build_project
uv run python scripts/scrape_google_reviews.py +proxy.server=SERVER +proxy.username=USER +proxy.password=PASSWORD

The recommended workflow of using the project assumes the following order of running the scripts:

  • scrape_google_reviews - create the raw dataset
  • process_dataset - extract all numerical features from the data
  • perform_eda - collect and visualize statistics describing the processed data
  • train_model - train a neural network that predicts the review sentiment
  • summarize_experiment - if multiple models are trained, use this script to compose stats and visualizations based on the test results

Development

In Contribution.md, there's a list of best practices a developer of this repository should follow.

Changelog

The changes made to the project are recorded to the Changelog.md file.