RestaurantReviewer

End-to-end mini-pipeline to collect Brazilian restaurant reviews from TripAdvisor (PT-BR), normalize + enrich them into a structured dataset, and then analyze the dataset with statistical plots and text-mining.

At a high level:

We collect review cards from locally saved TripAdvisor pages.
We parse/normalize fields (dates, ratings, categories, locations) and enrich with AI-powered representations (embeddings + 2D projections).
We explore the dataset in a notebook: distributions, correlations, seasonality, topics/keywords, and outliers.

Install

This repo uses Conda via environment.yml.

conda env create -f environment.yml
conda activate restaurantreviewer

If you will run embeddings / LLM calls, create a .env file with:

OPENAI_API_KEY=...your-key...

Check .env.example for example.

Key files (pipeline)

1-extract-data.py (main)
- Purpose: load each HTML page under full_page/tripadvisor/ in a browser (Selenium) and extract each review card (data-automation="reviewCard").
- Outcome: saves individual cards to raw_data/tripadvisor/card_<page>_<idx>.html (pretty-indented for debugging).
2-normalize-and-enrich.py (main)
- Purpose: parse the raw card HTML files, normalize to a consistent schema, and enrich with AI features (text embeddings + dimensionality reduction columns).
- Outcome: writes the final dataset CSV.
dataframes/tripadvisor.csv (main)
- Purpose: the canonical dataset produced by the pipeline.
- Outcome: 1 row per review with normalized fields (ratings, text, dates, location, etc.) plus enrichment columns.
analysis.ipynb (main)
- Purpose: statistical + textual analysis of the dataset.
- Outcome: plots and tables for review behavior, seasonality, sponsorship effects, topic discovery, keyword associations, sentiment proxy, dimensionality reduction visualizations, and outlier detection.
generate_reports.py (main)
- Purpose: generate comprehensive analysis reports in English and Portuguese based on the dataset.
- Outcome: markdown reports in reports/ folder with sentiment analysis, behavioral patterns, temporal trends, recommendations, and key findings.

`generate_reports.py` vs `analysis.ipynb`

Both use the same dataset (dataframes/tripadvisor.csv), but with different goals:

analysis.ipynb: exploratory, interactive analysis (plots/tables + deeper text/embedding exploration).
generate_reports.py: batch generator that writes a consistent, shareable narrative report in EN + PT-BR to reports/.

For a detailed metric-by-metric mapping (what the script computes vs which notebook cells cover it), see generate_reports_vs_analysis.md.

Data flow

Download full page content to full_page/tripadvisor/*.html
1-extract-data.py: extract HTML cards
2-normalize-and-enrich.py: raw_data/tripadvisor/card_*.html: normalize and enrich reviews
dataframes/tripadvisor.csv: dataframe created
analysis.ipynb: present data analysis
generate_reports.py: generate comprehensive reports

Run the pipeline

Put the TripAdvisor pages you saved locally into full_page/tripadvisor/.
Extract raw review-card HTML:

python 1-extract-data.py

Normalize + enrich into a dataset:

python 2-normalize-and-enrich.py

Open analysis.ipynb and run all cells.
Generate comprehensive reports:

python generate_reports.py

Analysis Reports

Objective

The goal of this analysis task was to create comprehensive, data-driven reports that analyze restaurant reviews from TripAdvisor. The reports aim to:

Analyze sentiment - Understand customer satisfaction levels and identify positive vs. negative sentiment patterns
Examine behavioral patterns - Study how reviewers interact with the platform (review length, image sharing, contribution levels)
Extract key themes - Identify main compliments and complaints through keyword analysis
Detect relationships - Explore correlations between ratings, review characteristics, temporal patterns, and sponsorship
Assess consensus - Determine if public opinion is unified, divided, or polarized
Provide recommendations - Offer actionable insights for restaurant improvement

Approach

We developed an automated Python-based analysis pipeline (generate_reports.py) that:

Loads the structured dataset from dataframes/tripadvisor.csv (75 reviews, 29 columns)
Computes statistical metrics including distributions, averages, and correlations across all available dimensions
Performs text analysis by extracting and ranking keywords while filtering Portuguese stopwords
Segments data by rating levels, temporal periods, sponsorship status, and user characteristics
Generates insights through comparative analysis of different customer segments
Produces bilingual reports in both English and Portuguese with identical analytical depth

Success Criteria

✅ Successfully accomplished:

Created /reports folder with both language versions
Generated comprehensive English report (en-us_report.md)
Generated comprehensive Portuguese report (pt-br_report.md)
Both reports include 9 major sections with detailed subsections
Analyzed all key metrics: sentiment (65.3% positive), behavioral patterns, temporal trends, sponsorship effects
Extracted top 30 keywords and identified recurring themes
Provided actionable recommendations based on data insights
Maintained identical analytical structure across both language versions

Report Contents

Both reports contain:

Executive Summary - High-level overview of findings
Dataset Overview - Basic statistics, rating distribution, review characteristics
Sentiment Analysis - Positive/negative/neutral breakdown, consensus assessment, review length patterns
Behavioral Patterns - Geographic distribution, dining company, review length vs rating, image sharing behavior
Temporal Analysis - Reviews by year, rating trends over time, weekday vs weekend patterns
Sponsored Content Analysis - Comparison of sponsored vs non-sponsored reviews
Keywords and Themes - Top 30 most common words, positive and negative themes identified
Rating Components - Sub-score analysis (food, service, cost, ambiance where available)
Key Findings and Recommendations - Main compliments, complaints, actionable recommendations, success factors

Key Findings

Overall sentiment: 65.3% positive (4-5★), 21.3% negative (1-2★), 13.3% neutral (3★)
Average rating: 3.81 out of 5 stars
Negative reviews are longer: Dissatisfied customers write 698-character reviews on average vs. 376 for positive reviews
Image behavior: Positive reviewers (4-5★) share images 57% of the time vs. only 14-22% for negative reviewers
Rating trend: Quality varied by year, with 2023 showing lowest average (3.43★) and 2022/2025 highest (~4.2★)
Main strengths: Food quality, Chef Jacquin's reputation, beautiful ambiance
Main weaknesses: Service inconsistency, wait times, management issues, pricing concerns

Access Reports

📄 English Report (en-us_report.md)

📄 Portuguese Report (pt-br_report.md)

Libraries used

Core scraping/parsing:

selenium, webdriver-manager for browser automation and stable local runs.
beautifulsoup4 + lxml for HTML parsing.
tqdm for progress bars.

Data + visualization:

pandas, numpy
matplotlib, seaborn
wordcloud

Machine learning / text analytics:

scikit-learn for:
- dimensionality reduction (PCA, TruncatedSVD, t-SNE, NMF)
- topic discovery & clustering (TfidfVectorizer, KMeans)
- outlier detection (e.g., LocalOutlierFactor, IsolationForest, etc. in the notebook)

AI / LLM integration:

Embeddings: the pipeline builds text embeddings for titles and reviews, then projects them into 2D with PCA/SVD/t-SNE to visualize structure and detect outliers.
- The current implementation in services/embeddings_service.py calls OpenAI embeddings (defaults to text-embedding-3-large).
LLM summaries: services/chatgpt_service.py provides a helper to generate a PT-BR sentiment summary of a list of reviews.

Results (current dataset)

Numbers below refer to the dataset currently saved at dataframes/tripadvisor.csv:

Total reviews: 75 rows / 29 columns
Date range: 2022-01-17 → 2025-12-29
Ratings distribution:
- 1★: 7
- 2★: 9
- 3★: 10
- 4★: 14
- 5★: 35
Sponsored reviews (is_parceria_patrocinada=True): 5 (6.67%)
Reviews with at least one image: 33 (44%)
- Mean images per review: 1.867 (max 13)
Top states by volume (UF parsed from cidade_e_estado): SP (28), MG (9), RJ (7), RS (4), PE (3), BA (2)

Notes

This project assumes the TripAdvisor pages are already saved locally (under full_page/tripadvisor/).
Please respect TripAdvisor’s terms and applicable laws when collecting data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RestaurantReviewer

Install

Key files (pipeline)

`generate_reports.py` vs `analysis.ipynb`

Data flow

Run the pipeline

Analysis Reports

Objective

Approach

Success Criteria

Report Contents

Key Findings

Access Reports

Libraries used

Core scraping/parsing:

Data + visualization:

Machine learning / text analytics:

AI / LLM integration:

Results (current dataset)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
dataframes		dataframes
extractors		extractors
full_page/tripadvisor		full_page/tripadvisor
models		models
normalizers		normalizers
raw_data/tripadvisor		raw_data/tripadvisor
reports		reports
services		services
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
1-extract-data.py		1-extract-data.py
2-normalize-and-enrich.py		2-normalize-and-enrich.py
README.md		README.md
analysis.ipynb		analysis.ipynb
environment.yml		environment.yml
generate_reports.py		generate_reports.py
generate_reports_vs_analysis.md		generate_reports_vs_analysis.md

Folders and files

Latest commit

History

Repository files navigation

RestaurantReviewer

Install

Key files (pipeline)

generate_reports.py vs analysis.ipynb

Data flow

Run the pipeline

Analysis Reports

Objective

Approach

Success Criteria

Report Contents

Key Findings

Access Reports

Libraries used

Core scraping/parsing:

Data + visualization:

Machine learning / text analytics:

AI / LLM integration:

Results (current dataset)

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`generate_reports.py` vs `analysis.ipynb`

Packages