An end-to-end ML/AI project that predicts stroke risk, clusters patients into meaningful groups, and generates treatment recommendations using association rule mining. It also includes a Streamlit app for interactive predictions.
-
Data Preprocessing
- Cleans and encodes the stroke dataset
- Handles missing values and categorical encoding
- Saves processed data
-
**Supervised Learning **
- Logistic Regression, Random Forest, XGBoost baselines
- 5-fold Cross Validation
- Evaluation with ROC AUC, Precision, Recall, F1
- SHAP-based feature importance
-
**Unsupervised Learning **
- KMeans & DBSCAN clustering
- Cluster profiles with mean feature summaries
- Risk-based cluster naming (High/Moderate/Low Risk groups)
-
Association Rules
- Simulated patient symptoms β treatments transactions
- Apriori + FP-Growth mining
- Top-10 rules exported for recommendations
-
**Streamlit App **
- Single patient risk prediction
- Cluster assignment with profile interpretation
- Recommended treatments from association rules
ai-healthcare-system/
β
βββ data/
β βββ raw/ # Original dataset
β βββ processed/ # Cleaned & preprocessed data
β β βββ stroke_data_processed.csv
β
βββ models/ # Trained & saved models
β βββ model.pkl # Best supervised model (LogReg / RF / XGB)
β βββ kmeans.pkl # Saved KMeans clustering model
β βββ scaler.pkl # Scaler used for clustering
β
βββ notebooks/ # Jupyter notebooks (experiments & reports)
β βββ 01-eda.ipynb # Exploratory Data Analysis
β βββ 02-supervised-baseline.ipynb# Baseline supervised models
β βββ 03-clustering.ipynb # Clustering experiments
β βββ 04-association.ipynb # Association rule mining
β
βββ src/ # Source code
β βββ data/
β β βββ load.py # Load raw/processed data
β β βββ preprocess.py # Data cleaning & feature engineering
β β
β βββ models/
β β βββ baseline.py # Pipelines for baseline models
β β βββ trainer.py # Training & cross-validation
β β βββ evaluate.py # Model evaluation & metrics
β β
β βββ unsupervised/
β β βββ clustering.py # KMeans & DBSCAN + cluster profiling
β β
β βββ association/
β β βββ apriori_rules.py # Association rules (Apriori/FP-Growth)
β β
β βββ app/
β βββ streamlit_app.py # Streamlit web app integration
β
βββ cluster_profiles.md # Cluster summaries (generated in Milestone 3)
βββ association_rules.csv # Top-10 rules (generated in Milestone 4)
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .gitignore # Files ignored by Git
- Clone the repo:
git clone https://github.com/yourusername/ai-healthcare-system.git cd ai-healthcare-system
2.Create a virtual environment:
conda create -n ai-healthcare python=3.10 -y
conda activate ai-healthcare
3.Install dependencies:
pip install -r requirements.txt
Follow this order to execute the project end-to-end:
# Run data loading
python src/data/load.py
# Run preprocessing
python src/data/preprocess.py
# Train baseline models (LogReg, RF, XGBoost)
python src/models/trainer.py
# Evaluate saved model
python src/models/evaluate.py
# Run clustering (KMeans + DBSCAN)
python src/unsupervised/clustering.py
# Mine Apriori + FP-Growth rules
python src/association/apriori_rules.py
# Launch the interactive app
streamlit run src/app/streamlit_app.pyπ Example Outputs π§ Model Performance
Logistic Regression (best CV ROC AUC β 0.84)
XGBoost: tunable for higher recall/precision
π Clustering
KMeans Silhouette Score β 0.15
Clusters:
Cluster 0 β High-Risk Group
Cluster 1 β Moderate-Risk Group
Cluster 2 β Low-Risk Younger Group
π Example Rule
symptom:hypertension, symptom:obese β treatment:antihypertensive, treatment:lifestyle_change
(Lift: 18.51, Confidence: 1.00)
Programming Language
- Python 3.10+
Libraries & Frameworks
-
Data Handling:
pandas,numpy -
Visualization:
matplotlib,seaborn -
Machine Learning:
scikit-learn,xgboost,imblearn -
Clustering:
scikit-learn (KMeans, DBSCAN) -
Association Rules:
mlxtend (Apriori, FP-Growth) -
Explainability:
shap -
App Framework:
streamlit -
Serialization:
joblib -
Deploy via Docker or Cloud: Package the app using Docker or deploy on platforms like Heroku, AWS, or GCP for wider accessibility.
-
Integrate Real Clinical Datasets: Incorporate real-world patient datasets with treatment + outcome mappings to improve the reliability of predictions.
-
Temporal Association Rules: Enhance the association rule mining by including temporal patient history (sequence of symptoms β treatments β outcomes).
-
Improved Interpretability: Add interactive LIME/SHAP dashboards within the app for doctors and researchers to better understand model decisions.

