Academic R project: predict which water pumps will need preventive maintenance after the first 6 months of operation, using only sensor readings from months 1–5. Random Forest reaches 79.8% test accuracy on a 99-pump hold-out set.
The business question is "of these pumps in the field, which 20,000 should we send a technician to?" A pump that needs servicing but is missed loses revenue (degraded volume / wasted energy); a pump that gets serviced but didn't need it wastes a technician visit. The trade-off is captured by maximizing net profit (revenue from extracted liquid minus maintenance and energy costs).
The modeling task is a binary classification on a synthetic target (Entretien_Necessaire) engineered from three operational signals: efficiency drop after month 6, projected month-12 cost above the third quartile, and below-average efficiency in the first 5 months.
Test set: 99 pumps held out from a 500-pump study, 80/20 split with set.seed(123).
| Metric | Value |
|---|---|
| Test accuracy | 79.8 % (95 % CI 70.5 – 87.2 %) |
| Balanced accuracy | 79.6 % |
| Sensitivity (recall on needs maintenance) | 80.7 % |
| Specificity | 78.6 % |
| Precision (positive predictive value) | 83.6 % |
| Cohen's Kappa | 0.589 |
| Accuracy vs. no-info rate (P-value) | 2.6 × 10⁻⁶ |
The accuracy is significantly better than the no-information baseline (57.6 %) at p ≈ 2.6 × 10⁻⁶ on a one-sided test.
A surprising finding: of the 70 candidate features (raw Volume, Energy, 11 PSD frequency channels at 5 monthly snapshots, plus engineered efficiency ratios), Random Forest's mean-decrease-Gini importance reduced the model to just 5 features above the threshold — all of them efficiency ratios:
Efficacite3 13.02
Efficacite5 11.28
Efficacite2 8.87
Efficacite4 7.98
Efficacite1 6.41
PSD30003 5.29 ← below threshold (6.0)
PSD17504 4.13
...
Raw sensor channels and PSD bands carry useful signal individually, but the Volume / Energy ratio captures it more compactly. The final model trains on those 5 efficiency variables only.
- Reshape:
sensors-study.csvis long (one row per pump-month); pivot wide to one row per pump with monthly columns (Volume1–Volume12,Energy1–Energy12,PSD500_1–PSD3000_12). - Engineer: derive
Efficaciteₘ = Volumeₘ / Energyₘfor each month. - Join: merge with
repairs.csvon pump ID; createTreatment = (Cost6 > 0). - Validate: ANOVA on monthly efficiency yields F = 6.11, p = 4.86 × 10⁻¹⁰ — month-to-month differences are highly significant.
- Engineer target (
Entretien_Necessaire): 1 if any of (a) untreated ANDEfficacite7 < Efficacite6, (b)Cost12 > 750(3rd quartile), (c) mean ofEfficacite1–Efficacite5below the global mean. Result: 287 positives / 213 negatives (out of 500). - Restrict to first-5-month features only — the model has to predict before month 6 service decisions are made.
- Filter features with Random Forest variable importance (cutoff: MeanDecreaseGini > 6) — drops 65 of 70 features.
- Fit Random Forest (500 trees) on the 5 retained features over an 80/20 split.
- Score the held-out
sensors-score.csvset; rank by class-1 probability; select the top 20,000.
- Language: R
- Modeling:
randomForest,caret(createDataPartition,confusionMatrix) - Statistical test: base R
aov(ANOVA) - Data wrangling:
dplyr,tidyr,stringr - Plots:
ggplot2 - Reporting: R Markdown → PDF (
Simulation_affaire_11353138.Rmd→.pdf) - Chart for this README: Python + matplotlib (
assets/generate_charts.py)
Maintenance-Predictive/
├── Code.R # Exploratory script: stats, efficiency curves
├── Simulation_affaire_11353138.Rmd # Main pipeline: target engineering + Random Forest
├── Simulation_affaire_11353138.pdf # Knit'd PDF report (with rendered plots)
├── Maintenance-Assignment_FR.pdf # Original assignment (French)
├── assets/
│ ├── confusion_matrix.png # Test confusion matrix (used in this README)
│ └── generate_charts.py # Reproducible chart from PDF-reported numbers
└── README.md
# In R / RStudio
install.packages(c("caret", "randomForest", "rpart", "ggplot2",
"dplyr", "tidyr", "stringr"))
# Open and knit the main report:
rmarkdown::render("Simulation_affaire_11353138.Rmd")Update the three read.csv() paths at the top of the Rmd to point to your local copies of repairs.csv, sensors-study.csv, and sensors-score.csv — they aren't in the repo (see Notes / Limitations).
To regenerate the README chart from the recorded numbers:
python assets/generate_charts.py- Academic project, not deployed. Submitted as Devoir 1 for a graduate Statistical Learning course (Apprentissage statistique), 2024-11-22. The downstream "select top 20,000 pumps" decision is a course exercise, not a real procurement workflow.
- CSV files are not committed. Paths in the
.Rand.Rmdfiles point at a local OneDrive folder (C:/Users/samso/OneDrive/Bureau/Apprentissage statistique/Devoir 1/). The data was provided as part of the course; reproducing the pipeline requires the original files. - Synthetic target.
Entretien_Necessaireis engineered from three rules over the same dataset the model is trained on. The 79.8 % accuracy measures how well RF reproduces those rules from first-5-month features, not a held-out clinical or operational ground truth. - Possible target leakage in the global efficiency threshold. Condition (c) compares each pump's first-5-month mean against the global efficiency mean computed over all 12 months × all 500 pumps. That threshold is fit on the same data later split into train/test — a strict no-leakage protocol would compute the threshold on the train fold only.
- Small test set (99 pumps). A 4-point edge over an 80 % baseline is plausibly noise; the 95 % CI (70.5 – 87.2 %) is wide. K-fold CV would give a tighter estimate.
- Variable importance is OOB on the full dataset before splitting. Strictly, the feature filter (MeanDecreaseGini > 6) sees the test rows. In practice the result is so dominated by
Efficacite1–Efficacite5that the filter is unlikely to be sensitive to the split, but documenting the order matters. - Mixed
_and non-_column conventions. The Rmd usespivot_widerwith default_separators, then strips them withrename_with(~ gsub("_", "", .)). Consequence:Volume_1becomesVolume1, butPSD500_1becomesPSD5001, which is easy to misread as PSD-5001.PSD30003in the importance table means PSD-3000 month-3, not PSD-30003.
