Skip to content

LucasPoncet/wine-quality-forecasting

Repository files navigation

DL-Wine: Predicting Wine Quality from Weather Patterns

Python Framework Visualization Lint Typing

This repository contains modular pipelines for predicting French wine quality from historical weather data. The workflow combines:

  • Weather datasets from MΓ©tΓ©o-France
  • Wine ratings scraped from Vivino
  • Geographic AOC mapping and weather matching
  • Deep learning models (MLP, TabNet, FT-Transformer)
    to anticipate vintage quality given yearly climate features.

Pipeline Overview

flowchart TD

    %% RAW INPUTS
    WRAW["Raw Weather Data\n(MΓ©tΓ©o-France)"]
    WPREP["Weather Preprocessing\n(cleaning, yearly parquet)"]
    A["Vivino Raw Wine Data\n(scraped)"]

    %% WINE CLEANING
    B["extract_dominant_cepage"]

    %% COORDINATE PIPELINE
    C["build_wines_coord.py\n(AOC fuzzy match + coordinate inference)"]

    %% WEATHER β†’ WINE FUSION
    D["merge_wine_and_weather.py\n(BallTree nearest station per year)"]

    %% FEATURES
    E["Feature Engineering\n(numeric + categorical)"]

    %% TRAINING
    F["Train Deep Models\n(MLP / TabNet / FT-Transformer)"]

    %% OUTPUTS
    G["Evaluation & Visualization"]

    %% WEATHER PIPELINE FLOW
    WRAW --> WPREP --> D

    %% WINE PIPELINE FLOW
    A --> B --> C --> D --> E --> F --> G

Loading

The diagram reflects the actual code structure in src/preprocessing, src/models, and src/visualization.


Repository Contents

  • Preprocessing pipelines
    • build_wines_coord.py: AOC matching, centroid computation, coordinate correction
    • merge_wine_weather.py: merge Vivino ratings with nearest weather stations
    • feature_engineering.py: compute derived numeric and categorical indicators for models
  • Model training (src/models/)
    • MLP, TabNet, and FT-Transformer architectures and training utilities
  • Visualization tools (src/visualization/)
    • Plotly maps for geographic data
    • Metric and distribution plots
  • Scraper module (src/scrapper/, under refactor)
    • Automated Vivino data extraction

Installation

This project uses uv for dependency and environment management.

1. Install uv

pipx install uv
# or
pip install uv

2. Install dependencies

uv sync

Install dev tools (pytest, ruff, mypy, pre-commit):

uv sync --group dev

3. Optional: Set PYTHONPATH

# macOS/Linux
export PYTHONPATH=$(pwd)

# Windows PowerShell
$env:PYTHONPATH = (Get-Location).Path

Data Overview

Weather

Daily weather observations from MΓ©tΓ©o-France are organized by French department and span approximately 1950–2025.

Processed files by year are stored in:

data/weather_by_year_cleaned/

(Intermediate folders such as data/weather/ or data/weather_by_year/ may also be present, depending on your local preprocessing steps.)

Wine

Vivino wine ratings and metadata (region, vintage, grape variety, rating) are stored in:

data/Wine/

Regional coordinates are defined in:

data/wine/regions.csv

Corrected region centroids, derived from Vivino data, are stored in:

data/Wine/region_centroids_from_wines_corrected.csv

An interactive wine region map is published at: Wine map 🍷


Quick Start

1. Generate Wine Coordinates (AOC + Centroids)

python -m src.preprocessing.build_wines_coord

This script:

  • fuzzy-matches AOC polygons to Vivino regions,
  • computes centroids in a metric CRS then reprojects to WGS84,
  • applies manual centroid corrections,
  • writes cleaned coordinates to data/out/.

2. Merge Vivino and Weather Data

python -m src.preprocessing.merge_wine_weather

This script:

  • expands regions across years (e.g. 2010–2024),
  • associates each region-year with the nearest weather station (within a distance threshold),
  • merges Vivino wines on (region, year),
  • saves:
data/out/vivino_wines_with_weather.csv
data/out/vivino_wines_with_weather.parquet

(The feature_engineering module is used programmatically by training code to add derived features on top of these merged datasets.)

3. Inspect the Resulting Dataset

import pandas as pd

wine = pd.read_csv("data/out/vivino_wines_with_weather.csv")
print(wine.head())

Project Structure

DL_Project/
β”œβ”€ data/                         # Raw and processed datasets
β”‚  β”œβ”€ weather_by_year_cleaned/   # Yearly cleaned weather files
β”‚  β”œβ”€ Wine/                      # Vivino raw & corrected data
β”‚  └─ out/                       # Outputs from preprocessing pipelines
β”œβ”€ src/
β”‚  β”œβ”€ models/
β”‚  β”‚   β”œβ”€ architectures/         # MLP, TabNet, FT-Transformer implementations
β”‚  β”‚   β”œβ”€ builders/              # Model-building helpers (e.g. TabNet, FTT)
β”‚  β”‚   β”œβ”€ training/              # Training pipelines (mlp_runner, tabnet_runner, ftt_runner, etc.)
β”‚  β”‚   └─ data/                  # Dataset loaders and modules
β”‚  β”œβ”€ preprocessing/
β”‚  β”‚   β”œβ”€ build_wines_coord.py   # Build coordinates for wines (AOC matching + centroids)
β”‚  β”‚   β”œβ”€ merge_wine_weather.py  # Merge Vivino and weather by region/year
β”‚  β”‚   β”œβ”€ feature_engineering.py # Engineered features for tabular models
β”‚  β”‚   └─ utils/                 # Shared text, geo, feature & weather helpers
β”‚  β”œβ”€ visualization/
β”‚  β”‚   β”œβ”€ plots/                 # Metrics plots, histograms, etc.
β”‚  β”‚   └─ maps/                  # Plotly maps (e.g. wine region map)
β”‚  └─ scrapper/                  # Vivino scrapers (currently under refactor)
β”œβ”€ scripts/                      # High-level experiment / baseline runners
β”œβ”€ tests/                        # Pytest suite for all components
β”œβ”€ models/                       # Trained model checkpoints
β”œβ”€ pyproject.toml
└─ README.md

Workflow

Data Acquisition (Scraping)

Raw wine data is collected from Vivino using automated scrapers located in: src/scraper/.

Raw weather data can be collected from : MΓ©tΓ©o-France

Data Preprocessing

  1. Weather cleaning (upstream / one-off)
    Prepare yearly cleaned weather files in data/weather_by_year_cleaned/.

  2. Build wine coordinates

    python -m src.preprocessing.build_wines_coord
  3. Merge wines with weather

    python -m src.preprocessing.merge_wine_weather
  4. Feature engineering (in-code)

    • src/preprocessing/feature_engineering.py and
    • src/preprocessing/utils/feature_utils.py
      define derived numeric and categorical indicators.
      These are used directly by training pipelines (e.g. TabNet and FT-Transformer runners).

Model Training and Evaluation

Model-specific training pipelines live under src/models/training/ and are exercised by the test suite. Example high-level runners (depending on your experiment setup):

# Baseline training / comparison
python -m scripts.run_baselines.py

# MLP model
python -m scripts.run_mlp.py

Trained weights are stored under models/ and evaluation plots under data/out/ and src/visualization/ outputs.


Technical Report

The project is documented in a LaTeX report, which details:

  • Data collection and preprocessing design
  • Modeling choices (architectures, loss functions, evaluation protocol)
  • Experiments and results (metrics, ablations)
  • Limitations and future work
docs/
└─ report/
   └─ wine_quality_report.pdf    # Compiled report
For full methodological details, see the [technical report](https://lucasponcet.github.io/report/wine_quality_report.pdf).

Testing

A comprehensive pytest suite covers:

  • Preprocessing utilities (text_utils, geo_utils, weather_utils, feature engineering)
  • Model components (architectures, builders, trainers)
  • End-to-end runners (e.g. TabNet/FTT/MLP pipelines)
  • Script entrypoints (scripts/run_baselines.py, scripts/search_optuna.py, etc.)

Run all tests with:

pytest

Contributing

Contributions are welcome. To propose changes:

  1. Fork the repository
  2. Create your feature branch:
    git checkout -b feature/new-analysis
  3. Commit your changes:
    git commit -m "Add new analysis"
  4. Push your branch:
    git push origin feature/new-analysis
  5. Open a Pull Request

License

This project is licensed under the MIT License.
See the LICENSE file for details.


This repository is intended to be self-contained and reproducible, so that reviewers can:

  • Understand the data pipeline end-to-end,
  • Re-run preprocessing and training with a few commands,
  • Inspect both the code and the accompanying technical report.

Releases

No releases published

Packages

 
 
 

Contributors