┌─────────────────────────────────────────────────────────────────┐
│ INPUT: Polishing Metrics │
│ (QUAST + BUSCO results for rounds 1-5) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Exploratory Data Analysis (2_exploratory_analysis.py) │
│ • Validate data integrity │
│ • Analyze distributions │
│ • Check for outliers │
│ • Generate summary statistics │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: Feature Engineering (3_feature_engineering.py) │
│ • Generate 40+ derived features │
│ • Delta features (Δ between rounds) │
│ • Ratio features (efficiency metrics) │
│ • R1-normalized features │
│ • Domain-specific scores │
│ • Plateau detection indicators │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: Label Optimal Rounds (4_label_optimal_round.py) │
│ • Identify optimal stopping round per group │
│ • Map 5 rounds → 3 classes (Early/Medium/Late) │
│ • Validate label │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 4: Train Models (5_train_models.py) │
│ • XGBoost with class weights │
│ • Random Forest (800 trees) │
│ • Ordinal Regression (LogisticAT) │
│ • Ensemble (Voting Classifier) │
│ • 5-fold stratified group CV │
│ • SMOTE for class balancing │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 5: Evaluate Models (8_evaluate_models.py) │
│ • Classification metrics (Balanced Acc, Macro F1) │
│ • Ordinal metrics (MAE, QWK, Acc±1) │
│ • Confusion matrices │
│ • Feature importance analysis │
│ • Practical impact assessment │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT: Trained Models + Reports │
│ • models/*.pkl (trained classifiers) │
│ • outputs/plots/ (visualizations) │
│ • outputs/results/ (metrics, reports) │
└─────────────────────────────────────────────────────────────────┘
| Script | Purpose |
|---|---|
1_csv_merge.py |
Merge polishing metrics from multiple sources |
2_exploratory_analysis.py |
Data validation and visualization |
3_feature_engineering.py |
Generate 40+ derived features |
4_label_optimal_round.py |
Assign 3-class labels |
5_train_models.py |
Train XGBoost, RF, Ordinal, Ensemble |
7_inference_pipeline.py |
Make predictions on new data |
8_evaluate_models.py |
Comprehensive evaluation |
run_pipeline.sh |
Execute complete pipeline |
The tool expects a CSV file with polishing metrics from QUAST and BUSCO for each round.
Sample,Genus,Coverage,round,n50,qv,error_rate,busco_complete,
busco_fragmented,busco_missing,assembly_frac,num_contigs,total_lengthSample,Genus,Coverage,round,n50,qv,error_rate,busco_complete,busco_fragmented,busco_missing,assembly_frac,num_contigs,total_length
sample_001,Escherichia,40X,1,234567,35.2,0.082,95.3,2.1,2.6,0.987,45,4856234
sample_001,Escherichia,40X,2,345678,38.1,0.045,97.8,1.2,1.0,0.993,32,4862341
sample_001,Escherichia,40X,3,456789,39.5,0.032,98.5,0.8,0.7,0.995,28,4865123
sample_001,Escherichia,40X,4,467890,39.8,0.029,98.7,0.6,0.7,0.996,26,4866234
sample_001,Escherichia,40X,5,468901,40.0,0.028,98.8,0.5,0.7,0.996,25,4866789| Column | Description | Source | Example |
|---|---|---|---|
Sample |
Unique sample identifier | User-defined | sample_001 |
Genus |
Bacterial genus | User-defined | Escherichia |
Coverage |
Sequencing coverage group | User-defined | 40X |
round |
Polishing round (1-5) | Sequential | 3 |
n50 |
N50 contig length | QUAST | 456789 |
qv |
Consensus quality value | QUAST | 39.5 |
error_rate |
Per-base error rate | QUAST | 0.032 |
busco_complete |
Complete BUSCO genes (%) | BUSCO | 98.5 |
busco_fragmented |
Fragmented BUSCO genes (%) | BUSCO | 0.8 |
busco_missing |
Missing BUSCO genes (%) | BUSCO | 0.7 |
assembly_frac |
Fraction of reference covered | QUAST | 0.995 |
num_contigs |
Number of contigs | QUAST | 28 |
total_length |
Total assembly length | QUAST | 4865123 |
- Minimum 5 rounds per sample: Each sample must have metrics for rounds 1-5
- Unique groups: Each (Sample, coverage) combination forms a group
- No missing values: All required columns must be populated
- Consistent coverage labels: use the same labels expected by the pipeline (e.g.
10X,20X,40X,FULL)
To prepare the input dataset go to BUILD_DATASET
Minimal example if you already have the input data ready.
By default, run_pipeline.sh starts at 2_exploratory_analysis.py assuming that
data/all_samples_polishing_metrics.csv already exists.
# 1. Clone and install
git clone https://github.com/jimmlucas/ESDP-Early-Stop-Decision-Polishing.git
cd ESDP-Early-Stop-Decision-Polishing
pip install -r requirements.txt
# 2. Prepare your data
# use your own CSV (must follow "Input Data Format")
cp your_polishing_metrics.csv data/all_samples_polishing_metrics.csv
# 3. Run the complete pipeline (starts at Step 1: EDA)
bash run_pipeline.sh
# 4. Check results
ls outputs/plots/ # Visualizations
ls outputs/results/ # Metrics and reports
ls models/ # Trained models + scaler + feature listThe easiest way to run the entire pipeline:
bash run_pipeline.shThis executes all steps sequentially:
- Exploratory data analysis
- Feature engineering
- Label assignment (3-class system)
- Model training (XGBoost, RF, Ordinal, Ensemble)
- Comprehensive evaluation
You can also run each step independently:
If you start from raw polishing outputs (QUAST/BUSCO/Flye per round), first build the consolidated CSV that all subsequent steps expect:
python 1_csv_merge.pyOutput
- create
data/all_samples_polishing_metrics.csv
python 2_exploratory_analysis.pyOutputs:
outputs/plots/correlation_matrix.png– Feature correlation heatmapoutputs/plots/coverage_distribution.png– Coverage distributionoutputs/plots/genus_distribution.png– Genus distributionoutputs/plots/metric_distributions.png– Distributions of base polishing metricsoutputs/plots/improvement_distributions.png– Distributions of metric improvementsoutputs/plots/metrics_by_round.png– Metrics evolution across roundsoutputs/results/eda_summary.txt– data quality / EDA summary report
What it does:
- Validates data integrity
- Checks for missing values
- Summarizes distributions of key metrics (QV, BUSCO, error rate, N50, contigs)
- Analyzes class distribution (genus, coverage, classes)
- Identifies outliers
- Generates summary statistics and EDA core EDA plots
python 3_feature_engineering.pyOutputs:
data/training_dataset_engineered.csv- Dataset with base metrics + engineered features
Generated features:
- Delta features (Δ between rounds:delta_qv, delta_busco_complete, delta_error_rate, etc.)
- Ratio features (efficiency metrics: qv_improvement_rate, busco_per_contig, n50_fraction, cost_benefit_ratio)
- R1-normalized features (qv_from_r1, busco_complete_from_r1, error_rate_from_r1, etc.)
- Domain-specific scores (completeness_score, assembly_quality, polishing_effectiveness)
- Plateau indicators (is_plateau, plateau_streak)
python 4_label_optimal_round.pyOutputs:
data/training_dataset_with_target.csv- Labeled dataset
Labeling strategy:
- Validates label consistency
- Applies conservative early-stop rules when R1 is already stable
- Identifies optimal stopping round per group
- Computes optimal round per Sample+Coverage group and maps 5→3 classes
- Detects plateaus using score improvement and a relative threshold from config.yaml
python 5_train_models.pyOutputs: Models and artifacts:
models/best_model.pkl– best-performing classifier (XGBoost, RF, ordinal, or ensemble)models/scaler.pkl– fitted StandardScaler used for featuresmodels/feature_names.txt– exact list of features used for training
Metrics:
- outputs/results/model_comparison.csv – metrics for all trained models
- outputs/results/training_metrics.json – same metrics in JSON format
Plots:
- outputs/plots/cm_xgboost.png – confusion matrix (XGBoost)
- outputs/plots/cm_random_forest.png – confusion matrix (Random Forest)
- outputs/plots/cm_ordinal.png – confusion matrix (Ordinal regression, if available)
- outputs/plots/cm_ensemble.png – confusion matrix (Ensemble, if available)
- outputs/plots/fi_xgboost.png – top feature importances (XGBoost)
- outputs/plots/fi_random_forest.png – top feature importances (Random Forest)
Training configuration:
- Group-aware split: train/test split stratified by class at group level (Sample+Coverage)
- SMOTE for class balancing (configurable in config.yaml)
- Class weights for imbalanced classes
- Multiple models trained:
- XGBoost (multi-class)
- Random Forest (multi-class)
- Ordinal regression
- Soft-voting ensemble (excluding ordinal model)
- Best model selected by highest balanced accuracy on the test set
python 8_evaluate_models.pyOutputs:
- outputs/results/baseline_comparison.csv – comparison between the trained model and simple baselines
What it compares:
- Best_Model – the best_model.pkl from Step 4
- Baseline_Always_Late – heuristic that always predicts class 3 (Late)
- Baseline_QV_Threshold_30 – heuristic: Early if QV > 30, else Late.
- Baseline_R1_Only_RF – Random Forest using only R1-level features.
Once models are trained, use them for inference:
# Predict on new data
python 7_inference_pipeline.py \
--input data/new_samples.csv \
--output predictions.csv \
--model models/best_model.pklPrediction output format:
Sample,Coverage,predicted_class,predicted_strategy,confidence,recommended_rounds,rationale,warnings
sample_001,40X,1,Early,0.87,2,High R1 quality and stable convergence,
sample_002,20X,2,Medium,0.65,3,Moderate improvement still detected,low_confidence
sample_003,10X,3,Late,0.91,5,Continued polishing recommended,Edit config.yaml to customize:
Click to see key configuration options
# Data paths
data:
input_csv: "data/all_samples_polishing_metrics.csv"
merged_csv: "data/all_samples_polishing_metrics.csv"
labeled_csv: "data/training_dataset_with_target.csv"
engineered_csv: "data/training_dataset_engineered.csv"
# Class configuration
classes:
n_classes: 3
class_mapping:
1: "Early (R1-R2)"
2: "Medium (R3-R4)"
3: "Late (R5)"
# R1 quality thresholds
r1_thresholds:
min_busco: 95.0
max_assembly_error: 0.02
max_error_rate: 0.07
min_coverage_est: 8.0
max_align_err_cons: 0.15
# Stability thresholds
stability:
eps_qv: 0.05
eps_error: 0.0005
eps_busco: 1.0
eps_assembly_frac: 0.01
use_assembly_frac: false
# Plateau detection
plateau:
relative_threshold: 0.12
# Model configuration
models:
random_state: 42
test_size: 0.20
cv_folds: 5
# Imbalance handling
imbalance:
use_smote: true
smote_k_neighbors: 3
smote_sampling_strategy: "auto"
# Evaluation
evaluation:
primary_metric: "balanced_accuracy"
# Hierarchical policy
hierarchical:
stage_a:
threshold: 0.5
model: "logistic_regression"
stage_b:
model: "random_forest"
# Post-processing
postprocessing:
conservative_bias: true
smooth_predictions: true
use_domain_rules: trueFor the full set of configurable options, see config.yaml.
ESDP-Early-Stop-Decision-Polishing/
├── README.md
├── LICENSE
├── requirements.txt
├── config.yaml
├── Dockerfile
├── docker-compose.yml
├── docker-entrypoint.sh
├── run_pipeline.sh
├── run_test.sh
│
├── 1_csv_merge.py
├── 2_exploratory_analysis.py
├── 3_feature_engineering.py
├── 4_label_optimal_round.py
├── 5_train_models.py
├── 7_inference_pipeline.py
├── 8_evaluate_models.py
├── 9_benchmark_resources.py
├── 10_sensitivity_analysis.py
│
├── api_service.py
├── esdp_cli.py
├── esdp_decide.py
│
├── docs/
│ ├── INSTALL.md
│ ├── USAGE.md
│ └── BUILD_DATASET.md
│
├── data/
│ ├── all_samples_polishing_metrics.csv
│ ├── training_dataset_engineered.csv
│ └── training_dataset_with_target.csv
│
├── models/
│ ├── best_model.pkl
│ ├── best_model_pipeline.pkl
│ ├── feature_names.txt
│ ├── imputer.pkl
│ └── scaler.pkl
│
├── outputs/
│ ├── plots/
│ ├── results/
│ ├── baseline_comparison.csv
│ ├── baseline_comparison_table.csv
│ ├── best_model_bootstrap_ci.json
│ ├── eda_summary.txt
│ ├── model_comparison.csv
│ ├── publication_summary_table.csv
│ ├── resource_benchmark_results.csv
│ ├── resource_benchmark_summary.csv
│ ├── train_test_split_samples.json
│ └── training_metrics.json
│
├── dataSet_preparation/
│ ├── subsample_reads.sh
│ └── src/
│ └── polish_advisor/
│ ├── __init__.py
│ ├── features.py
│ ├── rstar.py
│ ├── run_pipeline.py
│ └── features/
│ └── collect_metrics.py
│
└── test/
├── conftest.py
├── test_api_service.py
├── test_esdp_decide.py
└── test_pipeline_integration.py