Skip to content

CrSamson/Maintenance-Predictive

Repository files navigation

Predictive Maintenance for Pumps

Academic R project: predict which water pumps will need preventive maintenance after the first 6 months of operation, using only sensor readings from months 1–5. Random Forest reaches 79.8% test accuracy on a 99-pump hold-out set.

🎯 Objective

The business question is "of these pumps in the field, which 20,000 should we send a technician to?" A pump that needs servicing but is missed loses revenue (degraded volume / wasted energy); a pump that gets serviced but didn't need it wastes a technician visit. The trade-off is captured by maximizing net profit (revenue from extracted liquid minus maintenance and energy costs).

The modeling task is a binary classification on a synthetic target (Entretien_Necessaire) engineered from three operational signals: efficiency drop after month 6, projected month-12 cost above the third quartile, and below-average efficiency in the first 5 months.

📊 Results

Random Forest test confusion matrix

Test set: 99 pumps held out from a 500-pump study, 80/20 split with set.seed(123).

Metric Value
Test accuracy 79.8 % (95 % CI 70.5 – 87.2 %)
Balanced accuracy 79.6 %
Sensitivity (recall on needs maintenance) 80.7 %
Specificity 78.6 %
Precision (positive predictive value) 83.6 %
Cohen's Kappa 0.589
Accuracy vs. no-info rate (P-value) 2.6 × 10⁻⁶

The accuracy is significantly better than the no-information baseline (57.6 %) at p ≈ 2.6 × 10⁻⁶ on a one-sided test.

What ended up mattering

A surprising finding: of the 70 candidate features (raw Volume, Energy, 11 PSD frequency channels at 5 monthly snapshots, plus engineered efficiency ratios), Random Forest's mean-decrease-Gini importance reduced the model to just 5 features above the threshold — all of them efficiency ratios:

Efficacite3   13.02
Efficacite5   11.28
Efficacite2    8.87
Efficacite4    7.98
Efficacite1    6.41
PSD30003       5.29   ← below threshold (6.0)
PSD17504       4.13
...

Raw sensor channels and PSD bands carry useful signal individually, but the Volume / Energy ratio captures it more compactly. The final model trains on those 5 efficiency variables only.

🏗️ Methodology

  1. Reshape: sensors-study.csv is long (one row per pump-month); pivot wide to one row per pump with monthly columns (Volume1Volume12, Energy1Energy12, PSD500_1PSD3000_12).
  2. Engineer: derive Efficaciteₘ = Volumeₘ / Energyₘ for each month.
  3. Join: merge with repairs.csv on pump ID; create Treatment = (Cost6 > 0).
  4. Validate: ANOVA on monthly efficiency yields F = 6.11, p = 4.86 × 10⁻¹⁰ — month-to-month differences are highly significant.
  5. Engineer target (Entretien_Necessaire): 1 if any of (a) untreated AND Efficacite7 < Efficacite6, (b) Cost12 > 750 (3rd quartile), (c) mean of Efficacite1Efficacite5 below the global mean. Result: 287 positives / 213 negatives (out of 500).
  6. Restrict to first-5-month features only — the model has to predict before month 6 service decisions are made.
  7. Filter features with Random Forest variable importance (cutoff: MeanDecreaseGini > 6) — drops 65 of 70 features.
  8. Fit Random Forest (500 trees) on the 5 retained features over an 80/20 split.
  9. Score the held-out sensors-score.csv set; rank by class-1 probability; select the top 20,000.

🛠️ Tech Stack

  • Language: R
  • Modeling: randomForest, caret (createDataPartition, confusionMatrix)
  • Statistical test: base R aov (ANOVA)
  • Data wrangling: dplyr, tidyr, stringr
  • Plots: ggplot2
  • Reporting: R Markdown → PDF (Simulation_affaire_11353138.Rmd.pdf)
  • Chart for this README: Python + matplotlib (assets/generate_charts.py)

📁 Repository Structure

Maintenance-Predictive/
├── Code.R                                # Exploratory script: stats, efficiency curves
├── Simulation_affaire_11353138.Rmd       # Main pipeline: target engineering + Random Forest
├── Simulation_affaire_11353138.pdf       # Knit'd PDF report (with rendered plots)
├── Maintenance-Assignment_FR.pdf         # Original assignment (French)
├── assets/
│   ├── confusion_matrix.png              # Test confusion matrix (used in this README)
│   └── generate_charts.py                # Reproducible chart from PDF-reported numbers
└── README.md

🚀 How to Run

# In R / RStudio
install.packages(c("caret", "randomForest", "rpart", "ggplot2",
                   "dplyr", "tidyr", "stringr"))

# Open and knit the main report:
rmarkdown::render("Simulation_affaire_11353138.Rmd")

Update the three read.csv() paths at the top of the Rmd to point to your local copies of repairs.csv, sensors-study.csv, and sensors-score.csv — they aren't in the repo (see Notes / Limitations).

To regenerate the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

  • Academic project, not deployed. Submitted as Devoir 1 for a graduate Statistical Learning course (Apprentissage statistique), 2024-11-22. The downstream "select top 20,000 pumps" decision is a course exercise, not a real procurement workflow.
  • CSV files are not committed. Paths in the .R and .Rmd files point at a local OneDrive folder (C:/Users/samso/OneDrive/Bureau/Apprentissage statistique/Devoir 1/). The data was provided as part of the course; reproducing the pipeline requires the original files.
  • Synthetic target. Entretien_Necessaire is engineered from three rules over the same dataset the model is trained on. The 79.8 % accuracy measures how well RF reproduces those rules from first-5-month features, not a held-out clinical or operational ground truth.
  • Possible target leakage in the global efficiency threshold. Condition (c) compares each pump's first-5-month mean against the global efficiency mean computed over all 12 months × all 500 pumps. That threshold is fit on the same data later split into train/test — a strict no-leakage protocol would compute the threshold on the train fold only.
  • Small test set (99 pumps). A 4-point edge over an 80 % baseline is plausibly noise; the 95 % CI (70.5 – 87.2 %) is wide. K-fold CV would give a tighter estimate.
  • Variable importance is OOB on the full dataset before splitting. Strictly, the feature filter (MeanDecreaseGini > 6) sees the test rows. In practice the result is so dominated by Efficacite1Efficacite5 that the filter is unlikely to be sensitive to the split, but documenting the order matters.
  • Mixed _ and non-_ column conventions. The Rmd uses pivot_wider with default _ separators, then strips them with rename_with(~ gsub("_", "", .)). Consequence: Volume_1 becomes Volume1, but PSD500_1 becomes PSD5001, which is easy to misread as PSD-5001. PSD30003 in the importance table means PSD-3000 month-3, not PSD-30003.

About

Projet de maintenance prédictive utilisant Random Forest pour identifier les pompes à entretenir à mi-parcours. Optimise les profits en combinant analyse statistique et apprentissage automatique sur données capteurs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors