End-to-end mini-pipeline to collect Brazilian restaurant reviews from TripAdvisor (PT-BR), normalize + enrich them into a structured dataset, and then analyze the dataset with statistical plots and text-mining.
At a high level:
- We collect review cards from locally saved TripAdvisor pages.
- We parse/normalize fields (dates, ratings, categories, locations) and enrich with AI-powered representations (embeddings + 2D projections).
- We explore the dataset in a notebook: distributions, correlations, seasonality, topics/keywords, and outliers.
This repo uses Conda via environment.yml.
conda env create -f environment.yml
conda activate restaurantreviewerIf you will run embeddings / LLM calls, create a .env file with:
OPENAI_API_KEY=...your-key...Check .env.example for example.
-
1-extract-data.py (main)
- Purpose: load each HTML page under
full_page/tripadvisor/in a browser (Selenium) and extract each review card (data-automation="reviewCard"). - Outcome: saves individual cards to
raw_data/tripadvisor/card_<page>_<idx>.html(pretty-indented for debugging).
- Purpose: load each HTML page under
-
2-normalize-and-enrich.py (main)
- Purpose: parse the raw card HTML files, normalize to a consistent schema, and enrich with AI features (text embeddings + dimensionality reduction columns).
- Outcome: writes the final dataset CSV.
-
dataframes/tripadvisor.csv (main)
- Purpose: the canonical dataset produced by the pipeline.
- Outcome: 1 row per review with normalized fields (ratings, text, dates, location, etc.) plus enrichment columns.
-
analysis.ipynb (main)
- Purpose: statistical + textual analysis of the dataset.
- Outcome: plots and tables for review behavior, seasonality, sponsorship effects, topic discovery, keyword associations, sentiment proxy, dimensionality reduction visualizations, and outlier detection.
-
generate_reports.py (main)
- Purpose: generate comprehensive analysis reports in English and Portuguese based on the dataset.
- Outcome: markdown reports in
reports/folder with sentiment analysis, behavioral patterns, temporal trends, recommendations, and key findings.
Both use the same dataset (dataframes/tripadvisor.csv), but with different goals:
- analysis.ipynb: exploratory, interactive analysis (plots/tables + deeper text/embedding exploration).
- generate_reports.py: batch generator that writes a consistent, shareable narrative report in EN + PT-BR to
reports/.
For a detailed metric-by-metric mapping (what the script computes vs which notebook cells cover it), see generate_reports_vs_analysis.md.
- Download full page content to
full_page/tripadvisor/*.html - 1-extract-data.py: extract HTML cards
- 2-normalize-and-enrich.py:
raw_data/tripadvisor/card_*.html: normalize and enrich reviews - dataframes/tripadvisor.csv: dataframe created
- analysis.ipynb: present data analysis
- generate_reports.py: generate comprehensive reports
-
Put the TripAdvisor pages you saved locally into
full_page/tripadvisor/. -
Extract raw review-card HTML:
python 1-extract-data.py- Normalize + enrich into a dataset:
python 2-normalize-and-enrich.py-
Open analysis.ipynb and run all cells.
-
Generate comprehensive reports:
python generate_reports.pyThe goal of this analysis task was to create comprehensive, data-driven reports that analyze restaurant reviews from TripAdvisor. The reports aim to:
- Analyze sentiment - Understand customer satisfaction levels and identify positive vs. negative sentiment patterns
- Examine behavioral patterns - Study how reviewers interact with the platform (review length, image sharing, contribution levels)
- Extract key themes - Identify main compliments and complaints through keyword analysis
- Detect relationships - Explore correlations between ratings, review characteristics, temporal patterns, and sponsorship
- Assess consensus - Determine if public opinion is unified, divided, or polarized
- Provide recommendations - Offer actionable insights for restaurant improvement
We developed an automated Python-based analysis pipeline (generate_reports.py) that:
- Loads the structured dataset from
dataframes/tripadvisor.csv(75 reviews, 29 columns) - Computes statistical metrics including distributions, averages, and correlations across all available dimensions
- Performs text analysis by extracting and ranking keywords while filtering Portuguese stopwords
- Segments data by rating levels, temporal periods, sponsorship status, and user characteristics
- Generates insights through comparative analysis of different customer segments
- Produces bilingual reports in both English and Portuguese with identical analytical depth
✅ Successfully accomplished:
- Created
/reportsfolder with both language versions - Generated comprehensive English report (
en-us_report.md) - Generated comprehensive Portuguese report (
pt-br_report.md) - Both reports include 9 major sections with detailed subsections
- Analyzed all key metrics: sentiment (65.3% positive), behavioral patterns, temporal trends, sponsorship effects
- Extracted top 30 keywords and identified recurring themes
- Provided actionable recommendations based on data insights
- Maintained identical analytical structure across both language versions
Both reports contain:
- Executive Summary - High-level overview of findings
- Dataset Overview - Basic statistics, rating distribution, review characteristics
- Sentiment Analysis - Positive/negative/neutral breakdown, consensus assessment, review length patterns
- Behavioral Patterns - Geographic distribution, dining company, review length vs rating, image sharing behavior
- Temporal Analysis - Reviews by year, rating trends over time, weekday vs weekend patterns
- Sponsored Content Analysis - Comparison of sponsored vs non-sponsored reviews
- Keywords and Themes - Top 30 most common words, positive and negative themes identified
- Rating Components - Sub-score analysis (food, service, cost, ambiance where available)
- Key Findings and Recommendations - Main compliments, complaints, actionable recommendations, success factors
- Overall sentiment: 65.3% positive (4-5★), 21.3% negative (1-2★), 13.3% neutral (3★)
- Average rating: 3.81 out of 5 stars
- Negative reviews are longer: Dissatisfied customers write 698-character reviews on average vs. 376 for positive reviews
- Image behavior: Positive reviewers (4-5★) share images 57% of the time vs. only 14-22% for negative reviewers
- Rating trend: Quality varied by year, with 2023 showing lowest average (3.43★) and 2022/2025 highest (~4.2★)
- Main strengths: Food quality, Chef Jacquin's reputation, beautiful ambiance
- Main weaknesses: Service inconsistency, wait times, management issues, pricing concerns
📄 English Report (en-us_report.md)
📄 Portuguese Report (pt-br_report.md)
selenium,webdriver-managerfor browser automation and stable local runs.beautifulsoup4+lxmlfor HTML parsing.tqdmfor progress bars.
pandas,numpymatplotlib,seabornwordcloud
scikit-learnfor:- dimensionality reduction (
PCA,TruncatedSVD,t-SNE,NMF) - topic discovery & clustering (
TfidfVectorizer,KMeans) - outlier detection (e.g.,
LocalOutlierFactor,IsolationForest, etc. in the notebook)
- dimensionality reduction (
- Embeddings: the pipeline builds text embeddings for titles and reviews, then projects them into 2D with PCA/SVD/t-SNE to visualize structure and detect outliers.
- The current implementation in
services/embeddings_service.pycalls OpenAI embeddings (defaults totext-embedding-3-large).
- The current implementation in
- LLM summaries:
services/chatgpt_service.pyprovides a helper to generate a PT-BR sentiment summary of a list of reviews.
Numbers below refer to the dataset currently saved at dataframes/tripadvisor.csv:
- Total reviews: 75 rows / 29 columns
- Date range: 2022-01-17 → 2025-12-29
- Ratings distribution:
- 1★: 7
- 2★: 9
- 3★: 10
- 4★: 14
- 5★: 35
- Sponsored reviews (
is_parceria_patrocinada=True): 5 (6.67%) - Reviews with at least one image: 33 (44%)
- Mean images per review: 1.867 (max 13)
- Top states by volume (UF parsed from
cidade_e_estado): SP (28), MG (9), RJ (7), RS (4), PE (3), BA (2)
- This project assumes the TripAdvisor pages are already saved locally (under
full_page/tripadvisor/). - Please respect TripAdvisor’s terms and applicable laws when collecting data.