An end-to-end machine learning pipeline for fine-grained housing price prediction in France, from raw data preprocessing to production deployment.
📖 Full documentation: aiml4os.github.io/funathon-project1
This project walks through the complete lifecycle of a machine learning application applied to real estate tabular data. Using synthetic data reproducing the French DVF+ (Demandes de Valeurs Foncières) and land registry dataset, the goal is to build a model that predicts housing prices at a fine geographic level, and then deploy it as a production-ready API.
The project is structured as a progressive, hands-on tutorial organized in five parts:
- Data description — understanding the input variables
- Pre-processing — cleaning and preparing the dataset
- ML model training — training and comparing gradient boosting models
- Model logging — loggingt.
- Model deployment — API deployment.
.
├── starting_point/ # Notebooks with exercises (to be completed)
├── intermediate_solutions/ # Step-by-step partial solutions
├── solution/ # Full reference solutions
├── 1-preprocessing.qmd # Part 1: Data preprocessing
├── 2-GB_model.qmd # Part 2: Gradient Boosting model
├── 2-RF_model.qmd # Part 2: Random Forest model
├── 3-metrics.qmd # Part 3: Model evaluation
├── 4-logging.qmd # Part 4: MLFlow experiment tracking
├── 5-deployment.qmd # Part 5: FastAPI deployment
├── pyproject.toml # Python dependencies (managed with uv)
└── _quarto.yaml # Quarto site configuration
The dataset is based on synthetic data mimicking the DVF+ (French land registry transactions) and land registry. Each row represents a single real estate sale and contains ~47 variables describing the property, the transaction, and its geographic location.
Key variables include:
| Variable | Description |
|---|---|
price |
Transaction value (target) |
farea |
Floor area (m²) |
prop_type |
Property type (1 = house, 2 = flat) |
prop_loc_citycode |
Municipality code |
prop_loc_x, prop_loc_y |
Geographic coordinates |
n_mrooms |
Number of main rooms |
n_slr |
Number of bedrooms |
prop_year_harm |
Year of construction |
trans_year |
Year of transaction |
dist_tosea |
Distance to the coastline |
n_garage, n_pool, n_terrace, ... |
Outbuildings and amenities |
See the full variable dictionary in the dedicated page.
Tools: pandas
This step covers:
- Handling missing values and outliers (e.g. filtering extreme
price_per_sqmvalues) - Selecting a relevant feature subset from the 47 available variables
- Computing derived features such as
price_per_sqm - Exploratory data analysis with
seaborn.pairplotandpandas.DataFrame.hist
Tools: scikit-learn
Reference: INSEE working document on bagging and boosting methods, crospint
Goals:
- Split data into training and test sets
- Train and compare three models:
GradientBoostingRegressor,HistGradientBoostingRegressor,RandomForestRegressor - Explore location encoding strategies (One-Hot Encoding, native categorical support)
- Hyperparameter tuning via cross-validation (grid search, random search, optionally Optuna)
- Apply early stopping and metric logging during training
- Evaluate models using MAPE and R²
This part covers:
- Experiment tracking with MLFlow: saving all runs, models, and associated metrics
- API deployment with FastAPI: expose a prediction endpoint that takes property attributes (surface, location, number of rooms, etc.) and returns a predicted price
- Python >= 3.13
uv- SSPCloud account (recommended)
- GitHub account (recommended)
# Fork the repository
git clone https://github.com/AIML4OS/funathon-project1.git
cd funathon-project1
# Install dependencies with uv
uv sync# Render the full Quarto website locally
uv run quarto render
# Or preview it
uv run quarto previewContributions are welcome! Whether you spotted a bug, a typo, an outdated dependency, or have an idea to improve the tutorial, here's how to get involved:
- Open an issue — head to the Issues tab and describe what you found or what you'd like to see. Please check that a similar issue doesn't already exist before opening a new one.
- Submit a Pull Request — fork the repository, make your changes on a dedicated branch, and open a PR against
main. Briefly describe what you changed and why. Linking the PR to the relevant issue is appreciated.
No contribution is too small — fixing a broken link or clarifying a comment is just as valuable as adding a new feature.
This project was developed as part of the AIML4OS Funathon — a collaborative hackathon focused on applying AI and machine learning methods to open statistical data.