Funathon Project 1 – Applying Machine Learning to Tabular Data

An end-to-end machine learning pipeline for fine-grained housing price prediction in France, from raw data preprocessing to production deployment.

📖 Full documentation: aiml4os.github.io/funathon-project1

Overview

This project walks through the complete lifecycle of a machine learning application applied to real estate tabular data. Using synthetic data reproducing the French DVF+ (Demandes de Valeurs Foncières) and land registry dataset, the goal is to build a model that predicts housing prices at a fine geographic level, and then deploy it as a production-ready API.

The project is structured as a progressive, hands-on tutorial organized in five parts:

Data description — understanding the input variables
Pre-processing — cleaning and preparing the dataset
ML model training — training and comparing gradient boosting models
Model logging — loggingt.
Model deployment — API deployment.

Project Structure

.
├── starting_point/          # Notebooks with exercises (to be completed)
├── intermediate_solutions/  # Step-by-step partial solutions
├── solution/                # Full reference solutions
├── 1-preprocessing.qmd      # Part 1: Data preprocessing
├── 2-GB_model.qmd           # Part 2: Gradient Boosting model
├── 2-RF_model.qmd           # Part 2: Random Forest model
├── 3-metrics.qmd            # Part 3: Model evaluation
├── 4-logging.qmd            # Part 4: MLFlow experiment tracking
├── 5-deployment.qmd         # Part 5: FastAPI deployment
├── pyproject.toml           # Python dependencies (managed with uv)
└── _quarto.yaml             # Quarto site configuration

The Five Parts

1. Data Description

The dataset is based on synthetic data mimicking the DVF+ (French land registry transactions) and land registry. Each row represents a single real estate sale and contains ~47 variables describing the property, the transaction, and its geographic location.

Key variables include:

Variable	Description
`price`	Transaction value (target)
`farea`	Floor area (m²)
`prop_type`	Property type (1 = house, 2 = flat)
`prop_loc_citycode`	Municipality code
`prop_loc_x`, `prop_loc_y`	Geographic coordinates
`n_mrooms`	Number of main rooms
`n_slr`	Number of bedrooms
`prop_year_harm`	Year of construction
`trans_year`	Year of transaction
`dist_tosea`	Distance to the coastline
`n_garage`, `n_pool`, `n_terrace`, ...	Outbuildings and amenities

See the full variable dictionary in the dedicated page.

2. Pre-processing

Tools: pandas

This step covers:

Handling missing values and outliers (e.g. filtering extreme price_per_sqm values)
Selecting a relevant feature subset from the 47 available variables
Computing derived features such as price_per_sqm
Exploratory data analysis with seaborn.pairplot and pandas.DataFrame.hist

3. Training a ML Model

Tools: scikit-learn

Reference: INSEE working document on bagging and boosting methods, crospint

Goals:

Split data into training and test sets
Train and compare three models: GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor
Explore location encoding strategies (One-Hot Encoding, native categorical support)
Hyperparameter tuning via cross-validation (grid search, random search, optionally Optuna)
Apply early stopping and metric logging during training
Evaluate models using MAPE and R²

4. & 5. Model logging and deployment

Tools: MLFlow, FastAPI

This part covers:

Experiment tracking with MLFlow: saving all runs, models, and associated metrics
API deployment with FastAPI: expose a prediction endpoint that takes property attributes (surface, location, number of rooms, etc.) and returns a predicted price

Getting Started

Prerequisites

Python >= 3.13
uv
SSPCloud account (recommended)
GitHub account (recommended)

Installation of this repo

# Fork the repository
git clone https://github.com/AIML4OS/funathon-project1.git
cd funathon-project1

# Install dependencies with uv
uv sync

Running the notebooks

# Render the full Quarto website locally
uv run quarto render

# Or preview it
uv run quarto preview

Contributing

Contributions are welcome! Whether you spotted a bug, a typo, an outdated dependency, or have an idea to improve the tutorial, here's how to get involved:

Open an issue — head to the Issues tab and describe what you found or what you'd like to see. Please check that a similar issue doesn't already exist before opening a new one.
Submit a Pull Request — fork the repository, make your changes on a dedicated branch, and open a PR against main. Briefly describe what you changed and why. Linking the PR to the relevant issue is appreciated.

No contribution is too small — fixing a broken link or clarifying a comment is just as valuable as adding a new feature.

About

This project was developed as part of the AIML4OS Funathon — a collaborative hackathon focused on applying AI and machine learning methods to open statistical data.

🔗 AIML4OS Organization

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.github/workflows		.github/workflows
_extensions/quarto-ext/fontawesome		_extensions/quarto-ext/fontawesome
intermediate_solutions		intermediate_solutions
solution		solution
starting_point		starting_point
temp		temp
.gitignore		.gitignore
.python-version		.python-version
1-preprocessing.qmd		1-preprocessing.qmd
2-GB_model.qmd		2-GB_model.qmd
2-RF_model.qmd		2-RF_model.qmd
3-metrics.qmd		3-metrics.qmd
4-logging.qmd		4-logging.qmd
5-deployment.qmd		5-deployment.qmd
GB.png		GB.png
README.md		README.md
_quarto.yaml		_quarto.yaml
custom.scss		custom.scss
github_fork.png		github_fork.png
index.qmd		index.qmd
intro_data.qmd		intro_data.qmd
onyxia_home.png		onyxia_home.png
pre-processing.png		pre-processing.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Funathon Project 1 – Applying Machine Learning to Tabular Data

Overview

Project Structure

The Five Parts

1. Data Description

2. Pre-processing

3. Training a ML Model

4. & 5. Model logging and deployment

Getting Started

Prerequisites

Installation of this repo

Running the notebooks

Contributing

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Funathon Project 1 – Applying Machine Learning to Tabular Data

Overview

Project Structure

The Five Parts

1. Data Description

2. Pre-processing

3. Training a ML Model

4. & 5. Model logging and deployment

Getting Started

Prerequisites

Installation of this repo

Running the notebooks

Contributing

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages