Skip to content

AbdulHadi17/Regression-Projects-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Machine Learning Regression Projects

This repository contains two comprehensive machine learning projects demonstrating both Linear Regression and Logistic Regression techniques using scikit-learn.

πŸ“‹ Table of Contents


🏦 Project 1: Loan Approval Classification (Logistic Regression)

Overview

A binary classification project that predicts whether a loan application will be approved or rejected based on applicant information and loan characteristics.

Dataset Description

Features:

  • Demographic Information:

    • person_age: Age of the applicant
    • person_gender: Gender (male/female)
    • person_education: Education level (High School, Associate, Bachelor, Master, Doctorate)
    • person_income: Annual income
    • person_emp_exp: Employment experience in years
    • person_home_ownership: Home ownership status (RENT, OWN, MORTGAGE, OTHER)
  • Loan Information:

    • loan_amnt: Loan amount requested
    • loan_intent: Purpose of loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION)
    • loan_int_rate: Interest rate
    • loan_percent_income: Loan amount as percentage of income
  • Credit Information:

    • cb_person_cred_hist_length: Credit history length
    • credit_score: Credit score
    • previous_loan_defaults_on_file: Previous loan defaults (Yes/No)

Problem Statement

Predict whether a loan application will be approved based on the applicant's demographic information, loan characteristics, and credit history. This is a binary classification problem using Logistic Regression.

Exploratory Data Analysis (EDA)

  • Data Quality Check: No missing values found in the dataset

  • Data Distribution Analysis:

    • Visualized categorical variable distributions using bar plots with percentages
    • Analyzed demographic patterns across gender, education, home ownership, and loan intent
  • Outlier Detection: Identified outliers in numerical features using boxplots and IQR method:

    • person_age: 4.86% outliers
    • person_income: 4.93% outliers
    • person_emp_exp: 3.83% outliers
    • loan_amnt: 5.22% outliers
    • loan_int_rate: 0.28% outliers
    • loan_percent_income: 1.65% outliers
    • cb_person_cred_hist_length: 3.04% outliers
    • credit_score: 1.04% outliers

Feature Engineering

1. Outlier Treatment

Applied capping method (winsorization) to handle outliers in:

  • loan_amnt
  • person_income
  • loan_percent_income

Outliers were capped at the upper and lower bounds calculated using IQR (Interquartile Range) method.

2. Encoding Techniques

  • Binary Encoding:

    • person_gender: female = 0, male = 1
    • previous_loan_defaults_on_file: No = 0, Yes = 1
  • Ordinal Encoding:

    • person_education:
      • High School = 0
      • Associate = 1
      • Bachelor = 2
      • Master = 3
      • Doctorate = 4
  • One-Hot Encoding:

    • person_home_ownership: Created dummy variables (OTHER, OWN, RENT)
    • loan_intent: Created dummy variables (EDUCATION, HOMEIMPROVEMENT, MEDICAL, PERSONAL, VENTURE)

3. Feature Scaling

Applied StandardScaler to normalize numerical features for better model performance.

4. Multicollinearity Check

Calculated Variance Inflation Factor (VIF) to detect multicollinearity among features.

Model Training

  • Algorithm: Logistic Regression
  • Cross-Validation: K-Fold Cross-Validation to ensure model generalization
  • Evaluation Metrics: Classification report including precision, recall, F1-score, and accuracy

Results

The model successfully classifies loan approval status with evaluation metrics reported through classification_report. Detailed performance metrics and confusion matrix analysis were performed to assess model accuracy.

Key Techniques Demonstrated

βœ… Exploratory Data Analysis (EDA)
βœ… Outlier Detection and Treatment
βœ… Multiple Encoding Techniques (Binary, Ordinal, One-Hot)
βœ… Feature Scaling (StandardScaler)
βœ… Multicollinearity Analysis (VIF)
βœ… K-Fold Cross-Validation
βœ… Model Evaluation with Classification Metrics


πŸ’° Project 2: Salary Prediction (Linear Regression)

Overview

A regression project that predicts employee salary based on age and years of experience using Linear Regression.

Dataset Description

  • Source: Kaggle - Salary Prediction for Beginner
  • Features:
    • Age: Age of the employee
    • Gender: Gender of the employee
    • Education Level: Highest education level
    • Job Title: Job position
    • Years of Experience: Total work experience
    • Salary: Target variable (annual salary)

Problem Statement

Predict employee salary based on Age and Years of Experience. This is a continuous regression problem using Linear Regression.

Data Preprocessing

  • Missing Value Handling: Removed all rows with missing values
  • Feature Selection: Selected Age and Years of Experience as predictor variables
  • Train-Test Split: 80% training, 20% testing (random_state=42)

Model Training

  • Algorithm: Linear Regression
  • Cross-Validation: 25-fold K-Fold Cross-Validation with shuffling

Model Coefficients

The trained model learned the following relationships:

  • Age Coefficient: 2657.46 (impact of age on salary)
  • Years of Experience Coefficient: 4008.06 (impact of experience on salary)

Results

Performance Metrics:

  • Cross-Validation Score: 84.67%
  • Training Score: 86.24%
  • Test Score (RΒ² Score): 88.85%
  • Mean Squared Error (MSE): 267,299,022.86
  • Root Mean Squared Error (RMSE): 16,349.28
  • Mean Absolute Error (MAE): 12,358.46

The model demonstrates excellent performance with an RΒ² score of 88.85%, indicating that 88.85% of the variance in salary can be explained by age and years of experience.

Visualization

  • Actual vs Predicted Scatter Plot: Visualizes model predictions against actual salaries with a 45-degree reference line showing perfect prediction.

Key Techniques Demonstrated

βœ… Data Cleaning (Missing Value Handling)
βœ… Feature Selection
βœ… Train-Test Split
βœ… K-Fold Cross-Validation (25 folds)
βœ… Linear Regression Model Training
βœ… Multiple Evaluation Metrics (RΒ², MSE, RMSE, MAE)
βœ… Result Visualization with Seaborn


πŸ› οΈ Technologies Used

  • Python 3.13.0
  • pandas: Data manipulation and analysis
  • numpy: Numerical computations
  • scikit-learn: Machine learning algorithms and tools
  • seaborn: Statistical data visualization
  • matplotlib: Plotting and visualization
  • statsmodels: Statistical modeling (VIF calculation)
  • joblib: Model serialization
  • Jupyter Notebook: Interactive development environment

πŸ“¦ Installation

  1. Clone the repository:
git clone https://github.com/AbdulHadi17/Regression-Projects-ML.git
cd Regression-Projects-ML

About

Basic Learning Projects to test regression (Linear and Logistic ) via scikit learn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors