This project explores the Pima Indians Diabetes Dataset to understand which medical and lifestyle-related factors are most strongly associated with diabetes outcomes.
The analysis is implemented in Python using a Jupyter notebook and focuses on data understanding, exploratory data analysis, visualization, and correlation-based interpretation.
- Project Overview
- Dataset Context
- Files in This Repository
- Tools and Libraries
- Analysis Workflow
- Visualizations Included
- Key Insights
- Why This Project Matters
- How to Run
- Future Improvements
- Author
The goal of this project is to study how variables such as:
- glucose
- insulin
- BMI
- age
- pregnancies
- blood pressure
- skin thickness
- diabetes pedigree function
relate to the likelihood of diabetes.
The notebook examines the structure of the dataset, highlights potential data quality issues, and uses visual exploration to identify the factors that appear most important when predicting the Outcome variable.
This dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases.
It is commonly used for introductory machine learning and exploratory data analysis tasks related to diabetes prediction.
Important characteristics of the dataset:
- 768 patient records
- 8 predictor variables
- 1 target variable:
Outcome - patients represented in this dataset are females at least 21 years old of Pima Indian heritage
Target variable:
Outcome = 1indicates diabetesOutcome = 0indicates no diabetes
Diabetic Factors.ipynb: main notebook containing the analysisdiabetes.csv: dataset used in the notebookREADME.md: project documentation
This project uses:
- Python
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Jupyter Notebook
The notebook follows a simple exploratory data analysis pipeline:
- Import Python libraries for analysis and visualization.
- Load the diabetes dataset into a Pandas DataFrame.
- Inspect sample rows and summary statistics.
- Review data types and dataset structure.
- Identify unrealistic zero values in medical columns such as:
Glucose,BloodPressure,SkinThickness,Insulin, andBMI. - Visualize the data distribution and feature behavior.
- Build a correlation matrix to understand relationships between features and diabetes outcome.
- Summarize the most influential factors and practical observations.
The notebook includes the following visuals:
- histogram
- horizontal bar plot
- box plot
- correlation heatmap
Visual preview from the original analysis:
Based on the notebook analysis, the project highlights the following patterns:
BMI,Age, andPregnanciesemerge as important external or demographic factors connected to diabetes outcome.GlucoseandInsulinappear as strongly influential internal medical indicators.- correlation analysis suggests that diabetes risk is associated with a combination of physiological and demographic variables rather than a single feature alone.
- the notebook notes that several medical columns contain unrealistic zero values, which is an important data quality consideration for downstream modeling work.
The notebook’s concluding interpretation emphasizes:
- keeping BMI under control may help reduce risk connected with high glucose and insulin levels
- glucose and insulin should be monitored more closely as age increases
- pregnancies may also interact with diabetes-related risk factors and should be observed carefully
This project is valuable because it demonstrates:
- practical exploratory data analysis on a real healthcare dataset
- basic medical-feature interpretation using Python
- correlation-driven reasoning before predictive modeling
- awareness of data quality issues in health datasets
- communication of results through charts and summary findings
It works well as a portfolio project for showcasing foundational data analysis skills in healthcare analytics and predictive problem framing.
- Clone the repository.
- Open the project folder.
- Install the required Python libraries if needed:
pip install numpy pandas matplotlib seaborn jupyter- Launch Jupyter Notebook:
jupyter notebook- Open
Diabetic Factors.ipynband run the cells.
This project could be extended by:
- handling zero values more rigorously through imputation or filtering
- adding feature engineering
- comparing diabetic and non-diabetic groups more formally
- building classification models such as logistic regression, decision trees, or random forests
- evaluating model accuracy and feature importance
- exporting cleaned charts directly into the repository for more reliable README rendering
Divya Thakur
- GitHub: DivyaThakur24
- LinkedIn: divya-thakurr
- Portfolio: divyathakur24.github.io/DivyaThakurPortfolio



