Skip to content

DivyaThakur24/DataAnalysisDiabetesFactor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis of Factors Affecting Diabetes

Python Pandas Seaborn Matplotlib Notebook

This project explores the Pima Indians Diabetes Dataset to understand which medical and lifestyle-related factors are most strongly associated with diabetes outcomes.

The analysis is implemented in Python using a Jupyter notebook and focuses on data understanding, exploratory data analysis, visualization, and correlation-based interpretation.

Table of Contents

Project Overview

The goal of this project is to study how variables such as:

  • glucose
  • insulin
  • BMI
  • age
  • pregnancies
  • blood pressure
  • skin thickness
  • diabetes pedigree function

relate to the likelihood of diabetes.

The notebook examines the structure of the dataset, highlights potential data quality issues, and uses visual exploration to identify the factors that appear most important when predicting the Outcome variable.

Dataset Context

This dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases.

It is commonly used for introductory machine learning and exploratory data analysis tasks related to diabetes prediction.

Important characteristics of the dataset:

  • 768 patient records
  • 8 predictor variables
  • 1 target variable: Outcome
  • patients represented in this dataset are females at least 21 years old of Pima Indian heritage

Target variable:

  • Outcome = 1 indicates diabetes
  • Outcome = 0 indicates no diabetes

Files in This Repository

Tools and Libraries

This project uses:

  • Python
  • NumPy
  • Pandas
  • Matplotlib
  • Seaborn
  • Jupyter Notebook

Analysis Workflow

The notebook follows a simple exploratory data analysis pipeline:

  1. Import Python libraries for analysis and visualization.
  2. Load the diabetes dataset into a Pandas DataFrame.
  3. Inspect sample rows and summary statistics.
  4. Review data types and dataset structure.
  5. Identify unrealistic zero values in medical columns such as: Glucose, BloodPressure, SkinThickness, Insulin, and BMI.
  6. Visualize the data distribution and feature behavior.
  7. Build a correlation matrix to understand relationships between features and diabetes outcome.
  8. Summarize the most influential factors and practical observations.

Visualizations Included

The notebook includes the following visuals:

  • histogram
  • horizontal bar plot
  • box plot
  • correlation heatmap

Visual preview from the original analysis:

Histogram

Histogram

Bar Plot

Bar plot

Box Plot

Box plot

Correlation Heatmap

Correlation heatmap

Key Insights

Based on the notebook analysis, the project highlights the following patterns:

  • BMI, Age, and Pregnancies emerge as important external or demographic factors connected to diabetes outcome.
  • Glucose and Insulin appear as strongly influential internal medical indicators.
  • correlation analysis suggests that diabetes risk is associated with a combination of physiological and demographic variables rather than a single feature alone.
  • the notebook notes that several medical columns contain unrealistic zero values, which is an important data quality consideration for downstream modeling work.

The notebook’s concluding interpretation emphasizes:

  • keeping BMI under control may help reduce risk connected with high glucose and insulin levels
  • glucose and insulin should be monitored more closely as age increases
  • pregnancies may also interact with diabetes-related risk factors and should be observed carefully

Why This Project Matters

This project is valuable because it demonstrates:

  • practical exploratory data analysis on a real healthcare dataset
  • basic medical-feature interpretation using Python
  • correlation-driven reasoning before predictive modeling
  • awareness of data quality issues in health datasets
  • communication of results through charts and summary findings

It works well as a portfolio project for showcasing foundational data analysis skills in healthcare analytics and predictive problem framing.

How to Run

  1. Clone the repository.
  2. Open the project folder.
  3. Install the required Python libraries if needed:
pip install numpy pandas matplotlib seaborn jupyter
  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open Diabetic Factors.ipynb and run the cells.

Future Improvements

This project could be extended by:

  • handling zero values more rigorously through imputation or filtering
  • adding feature engineering
  • comparing diabetic and non-diabetic groups more formally
  • building classification models such as logistic regression, decision trees, or random forests
  • evaluating model accuracy and feature importance
  • exporting cleaned charts directly into the repository for more reliable README rendering

Author

Divya Thakur

About

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger …

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors