Predictive Maintenance for Pumps

Academic R project: predict which water pumps will need preventive maintenance after the first 6 months of operation, using only sensor readings from months 1–5. Random Forest reaches 79.8% test accuracy on a 99-pump hold-out set.

🎯 Objective

The business question is "of these pumps in the field, which 20,000 should we send a technician to?" A pump that needs servicing but is missed loses revenue (degraded volume / wasted energy); a pump that gets serviced but didn't need it wastes a technician visit. The trade-off is captured by maximizing net profit (revenue from extracted liquid minus maintenance and energy costs).

The modeling task is a binary classification on a synthetic target (Entretien_Necessaire) engineered from three operational signals: efficiency drop after month 6, projected month-12 cost above the third quartile, and below-average efficiency in the first 5 months.

📊 Results

Test set: 99 pumps held out from a 500-pump study, 80/20 split with set.seed(123).

Metric	Value
Test accuracy	79.8 % (95 % CI 70.5 – 87.2 %)
Balanced accuracy	79.6 %
Sensitivity (recall on needs maintenance)	80.7 %
Specificity	78.6 %
Precision (positive predictive value)	83.6 %
Cohen's Kappa	0.589
Accuracy vs. no-info rate (P-value)	2.6 × 10⁻⁶

The accuracy is significantly better than the no-information baseline (57.6 %) at p ≈ 2.6 × 10⁻⁶ on a one-sided test.

What ended up mattering

A surprising finding: of the 70 candidate features (raw Volume, Energy, 11 PSD frequency channels at 5 monthly snapshots, plus engineered efficiency ratios), Random Forest's mean-decrease-Gini importance reduced the model to just 5 features above the threshold — all of them efficiency ratios:

Efficacite3   13.02
Efficacite5   11.28
Efficacite2    8.87
Efficacite4    7.98
Efficacite1    6.41
PSD30003       5.29   ← below threshold (6.0)
PSD17504       4.13
...

Raw sensor channels and PSD bands carry useful signal individually, but the Volume / Energy ratio captures it more compactly. The final model trains on those 5 efficiency variables only.

🏗️ Methodology

Reshape: sensors-study.csv is long (one row per pump-month); pivot wide to one row per pump with monthly columns (Volume1–Volume12, Energy1–Energy12, PSD500_1–PSD3000_12).
Engineer: derive Efficaciteₘ = Volumeₘ / Energyₘ for each month.
Join: merge with repairs.csv on pump ID; create Treatment = (Cost6 > 0).
Validate: ANOVA on monthly efficiency yields F = 6.11, p = 4.86 × 10⁻¹⁰ — month-to-month differences are highly significant.
Engineer target (Entretien_Necessaire): 1 if any of (a) untreated AND Efficacite7 < Efficacite6, (b) Cost12 > 750 (3rd quartile), (c) mean of Efficacite1–Efficacite5 below the global mean. Result: 287 positives / 213 negatives (out of 500).
Restrict to first-5-month features only — the model has to predict before month 6 service decisions are made.
Filter features with Random Forest variable importance (cutoff: MeanDecreaseGini > 6) — drops 65 of 70 features.
Fit Random Forest (500 trees) on the 5 retained features over an 80/20 split.
Score the held-out sensors-score.csv set; rank by class-1 probability; select the top 20,000.

🛠️ Tech Stack

Language: R
Modeling: randomForest, caret (createDataPartition, confusionMatrix)
Statistical test: base R aov (ANOVA)
Data wrangling: dplyr, tidyr, stringr
Plots: ggplot2
Reporting: R Markdown → PDF (Simulation_affaire_11353138.Rmd → .pdf)
Chart for this README: Python + matplotlib (assets/generate_charts.py)

📁 Repository Structure

Maintenance-Predictive/
├── Code.R                                # Exploratory script: stats, efficiency curves
├── Simulation_affaire_11353138.Rmd       # Main pipeline: target engineering + Random Forest
├── Simulation_affaire_11353138.pdf       # Knit'd PDF report (with rendered plots)
├── Maintenance-Assignment_FR.pdf         # Original assignment (French)
├── assets/
│   ├── confusion_matrix.png              # Test confusion matrix (used in this README)
│   └── generate_charts.py                # Reproducible chart from PDF-reported numbers
└── README.md

🚀 How to Run

# In R / RStudio
install.packages(c("caret", "randomForest", "rpart", "ggplot2",
                   "dplyr", "tidyr", "stringr"))

# Open and knit the main report:
rmarkdown::render("Simulation_affaire_11353138.Rmd")

Update the three read.csv() paths at the top of the Rmd to point to your local copies of repairs.csv, sensors-study.csv, and sensors-score.csv — they aren't in the repo (see Notes / Limitations).

To regenerate the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

Academic project, not deployed. Submitted as Devoir 1 for a graduate Statistical Learning course (Apprentissage statistique), 2024-11-22. The downstream "select top 20,000 pumps" decision is a course exercise, not a real procurement workflow.
CSV files are not committed. Paths in the .R and .Rmd files point at a local OneDrive folder (C:/Users/samso/OneDrive/Bureau/Apprentissage statistique/Devoir 1/). The data was provided as part of the course; reproducing the pipeline requires the original files.
Synthetic target. Entretien_Necessaire is engineered from three rules over the same dataset the model is trained on. The 79.8 % accuracy measures how well RF reproduces those rules from first-5-month features, not a held-out clinical or operational ground truth.
Possible target leakage in the global efficiency threshold. Condition (c) compares each pump's first-5-month mean against the global efficiency mean computed over all 12 months × all 500 pumps. That threshold is fit on the same data later split into train/test — a strict no-leakage protocol would compute the threshold on the train fold only.
Small test set (99 pumps). A 4-point edge over an 80 % baseline is plausibly noise; the 95 % CI (70.5 – 87.2 %) is wide. K-fold CV would give a tighter estimate.
Variable importance is OOB on the full dataset before splitting. Strictly, the feature filter (MeanDecreaseGini > 6) sees the test rows. In practice the result is so dominated by Efficacite1–Efficacite5 that the filter is unlikely to be sensitive to the split, but documenting the order matters.
Mixed _ and non-_ column conventions. The Rmd uses pivot_wider with default _ separators, then strips them with rename_with(~ gsub("_", "", .)). Consequence: Volume_1 becomes Volume1, but PSD500_1 becomes PSD5001, which is easy to misread as PSD-5001. PSD30003 in the importance table means PSD-3000 month-3, not PSD-30003.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predictive Maintenance for Pumps

🎯 Objective

📊 Results

What ended up mattering

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
Code.R		Code.R
Maintenance-Assignment_FR.pdf		Maintenance-Assignment_FR.pdf
README.md		README.md
Simulation_affaire_11353138.Rmd		Simulation_affaire_11353138.Rmd
Simulation_affaire_11353138.pdf		Simulation_affaire_11353138.pdf

Folders and files

Latest commit

History

Repository files navigation

Predictive Maintenance for Pumps

🎯 Objective

📊 Results

What ended up mattering

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages