Data Analysis of Factors Affecting Diabetes

This project explores the Pima Indians Diabetes Dataset to understand which medical and lifestyle-related factors are most strongly associated with diabetes outcomes.

The analysis is implemented in Python using a Jupyter notebook and focuses on data understanding, exploratory data analysis, visualization, and correlation-based interpretation.

Project Overview

The goal of this project is to study how variables such as:

glucose
insulin
BMI
age
pregnancies
blood pressure
skin thickness
diabetes pedigree function

relate to the likelihood of diabetes.

The notebook examines the structure of the dataset, highlights potential data quality issues, and uses visual exploration to identify the factors that appear most important when predicting the Outcome variable.

Dataset Context

This dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases.

It is commonly used for introductory machine learning and exploratory data analysis tasks related to diabetes prediction.

Important characteristics of the dataset:

768 patient records
8 predictor variables
1 target variable: Outcome
patients represented in this dataset are females at least 21 years old of Pima Indian heritage

Target variable:

Outcome = 1 indicates diabetes
Outcome = 0 indicates no diabetes

Files in This Repository

Diabetic Factors.ipynb: main notebook containing the analysis
diabetes.csv: dataset used in the notebook
README.md: project documentation

Tools and Libraries

This project uses:

Python
NumPy
Pandas
Matplotlib
Seaborn
Jupyter Notebook

Analysis Workflow

The notebook follows a simple exploratory data analysis pipeline:

Import Python libraries for analysis and visualization.
Load the diabetes dataset into a Pandas DataFrame.
Inspect sample rows and summary statistics.
Review data types and dataset structure.
Identify unrealistic zero values in medical columns such as: Glucose, BloodPressure, SkinThickness, Insulin, and BMI.
Visualize the data distribution and feature behavior.
Build a correlation matrix to understand relationships between features and diabetes outcome.
Summarize the most influential factors and practical observations.

Visualizations Included

The notebook includes the following visuals:

histogram
horizontal bar plot
box plot
correlation heatmap

Visual preview from the original analysis:

Histogram

Bar Plot

Box Plot

Correlation Heatmap

Key Insights

Based on the notebook analysis, the project highlights the following patterns:

BMI, Age, and Pregnancies emerge as important external or demographic factors connected to diabetes outcome.
Glucose and Insulin appear as strongly influential internal medical indicators.
correlation analysis suggests that diabetes risk is associated with a combination of physiological and demographic variables rather than a single feature alone.
the notebook notes that several medical columns contain unrealistic zero values, which is an important data quality consideration for downstream modeling work.

The notebook’s concluding interpretation emphasizes:

keeping BMI under control may help reduce risk connected with high glucose and insulin levels
glucose and insulin should be monitored more closely as age increases
pregnancies may also interact with diabetes-related risk factors and should be observed carefully

Why This Project Matters

This project is valuable because it demonstrates:

practical exploratory data analysis on a real healthcare dataset
basic medical-feature interpretation using Python
correlation-driven reasoning before predictive modeling
awareness of data quality issues in health datasets
communication of results through charts and summary findings

It works well as a portfolio project for showcasing foundational data analysis skills in healthcare analytics and predictive problem framing.

How to Run

Clone the repository.
Open the project folder.
Install the required Python libraries if needed:

pip install numpy pandas matplotlib seaborn jupyter

Launch Jupyter Notebook:

jupyter notebook

Open Diabetic Factors.ipynb and run the cells.

Future Improvements

This project could be extended by:

handling zero values more rigorously through imputation or filtering
adding feature engineering
comparing diabetic and non-diabetic groups more formally
building classification models such as logistic regression, decision trees, or random forests
evaluating model accuracy and feature importance
exporting cleaned charts directly into the repository for more reliable README rendering

Author

Divya Thakur

GitHub: DivyaThakur24
LinkedIn: divya-thakurr
Portfolio: divyathakur24.github.io/DivyaThakurPortfolio

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Diabetic Factors.ipynb		Diabetic Factors.ipynb
README.md		README.md
diabetes.csv		diabetes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly