astro-physics-sl

Final course project — Statistical Learning (MD2SL) University of Florence — A.Y. 2024-2025

A systematic comparative analysis of supervised learning methods applied to two real-world datasets: one from physics (predicting the critical temperature of superconductors — regression) and one from astrophysics (classifying pulsar candidates — binary classification).

Overview

This project benchmarks 13 regression methods and 5 classification methods on real-world data, covering the full spectrum from classical linear models to Bayesian ensemble methods.

Regression — Superconductivity ($T_c$)

Group	Methods
Baselines	OLS, Polynomial Regression
Regularization	Ridge, Lasso, Elastic Net, Adaptive Lasso, Group Lasso
Subset Selection	Best Subset (ABESS)
Tree-based	CART, Bagging, Random Forest, XGBoost, BART

Classification — Pulsar (HTRU2)

CART, Bagging, Random Forest, XGBoost, BART — each tested with and without SMOTE to handle class imbalance (91% noise / 9% pulsar).

Additional Analyses

Variable Importance comparison across methods (Lasso, RF, XGBoost, BART)
Post-selection inference: Knockoff Filter and Stability Selection after Lasso
DeLong test for statistical comparison of ROC curves
SMOTE impact analysis on classification metrics
Coefficient path visualization (Ridge, Lasso, Elastic Net)
Bias-variance tradeoff discussion on the observed results

Datasets

Superconductivity (regression)

Source: UCI ML Repository
n = 21,263 superconducting materials, p = 81 features
Target: critical temperature $T_c$ (Kelvin)
The 81 features are aggregate statistics (mean, std, range, entropy, …) of elemental atomic properties

HTRU2 Pulsar (classification)

Source: UCI ML Repository
n = 17,898 candidates, p = 8 features
Target: pulsar (1) vs. noise (0)
The 8 features are mean, std, kurtosis and skewness of the integrated profile and DM-SNR curve

Project Structure

astro-physics-sl/
├── data/
│   ├── raw/               # Original CSVs (downloaded from UCI)
│   └── processed/         # Preprocessed data (.rds)
├── R/
│   ├── 00_setup.R         # Libraries, seed, global config
│   ├── 01_data_loading.R  # Download and load datasets
│   ├── 02_eda.R           # Exploratory data analysis
│   ├── 03_preprocessing.R # tidymodels recipes
│   ├── 04_cv_setup.R      # Shared train/test split and CV folds
│   ├── regression/
│   │   ├── 05a_baseline.R       # OLS and polynomial
│   │   ├── 05b_regularization.R # Ridge, Lasso, EN, Adaptive, Group
│   │   ├── 05c_subset.R         # ABESS
│   │   ├── 05d_trees.R          # CART, Bagging, RF, XGBoost, BART
│   │   ├── 06_comparison.R      # Final comparison and plots
│   │   └── 07_post_selection.R  # Knockoff and Stability Selection
│   ├── classification/
│   │   └── 05_pipeline.R        # All 5 methods ± SMOTE
│   └── utils/
│       ├── helpers.R             # Helper functions
│       └── plotting.R            # Plotting functions
├── output/
│   ├── figures/           # Plots (PNG and PDF)
│   ├── models/            # Saved models (.rds)
│   └── tables/            # Results tables (.csv, .rds)
└── report/                # Final report (optional)

Getting Started

Prerequisites

R ≥ 4.3 (tested with R 4.5.2)
Internet connection for first-time data download

Installation

Clone the repository:

git clone https://github.com/battles5/astro-physics-sl.git
cd astro-physics-sl

Install dependencies. Open R in the project folder and run:

# If you have renv (recommended):
renv::restore()

# Otherwise install manually:
install.packages(c(
  "tidyverse", "glmnet", "rpart", "rpart.plot",
  "ranger", "xgboost", "BART", "abess", "leaps",
  "grpreg", "knockoff", "stabs",
  "vip", "pROC", "themis", "rsample",
  "caret", "yardstick", "recipes", "parsnip",
  "ggcorrplot", "gridExtra", "scales"
))

Running

Scripts must be executed in numerical order. Each script saves results to output/ and subsequent scripts read from there.

# 1. Setup and data loading
source("R/00_setup.R")
source("R/01_data_loading.R")    # Downloads CSVs from UCI if not present
source("R/02_eda.R")             # EDA + exploratory plots
source("R/03_preprocessing.R")   # Preprocessing recipes
source("R/04_cv_setup.R")        # 75/25 split + 10-fold CV x5

# 2. Regression (in order)
source("R/regression/05a_baseline.R")
source("R/regression/05b_regularization.R")
source("R/regression/05c_subset.R")
source("R/regression/05d_trees.R")
source("R/regression/06_comparison.R")
source("R/regression/07_post_selection.R")

# 3. Classification
source("R/classification/05_pipeline.R")

Data

Raw CSV files are not included in the repository due to size. The script 01_data_loading.R downloads them automatically from UCI on first run. Processed files (.rds) are generated automatically by the scripts.

Output

All plots and tables are saved to output/figures/ and output/tables/. Trained models are saved to output/models/.

Main Results

Regression

Method	RMSE	R²
Random Forest	9.33	0.925
XGBoost	9.37	0.924
Bagging	9.39	0.924
BART	11.59	0.884
CART	12.30	0.870
OLS	17.66	0.730
Lasso	17.72	0.729
Ridge	18.87	0.691

Classification

Method	AUC	Balanced Acc.
XGBoost	0.976	0.944
BART	0.975	0.935
RF	0.969	0.938
Bagging	0.967	0.927
CART	0.939	0.913

References

James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.
Hamidieh, K. (2018). A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154, 346–354.
Lyon, R.J. et al. (2016). Fifty years of pulsar candidate selection. MNRAS, 459(1), 1104–1123.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
data/raw		data/raw
output		output
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

astro-physics-sl

Overview

Regression — Superconductivity ($T_c$)

Classification — Pulsar (HTRU2)

Additional Analyses

Datasets

Superconductivity (regression)

HTRU2 Pulsar (classification)

Project Structure

Getting Started

Prerequisites

Installation

Running

Data

Output

Main Results

Regression

Classification

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

astro-physics-sl

Overview

Regression — Superconductivity ($T_c$)

Classification — Pulsar (HTRU2)

Additional Analyses

Datasets

Superconductivity (regression)

HTRU2 Pulsar (classification)

Project Structure

Getting Started

Prerequisites

Installation

Running

Data

Output

Main Results

Regression

Classification

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages