Machine Learning Regression Projects

This repository contains two comprehensive machine learning projects demonstrating both Linear Regression and Logistic Regression techniques using scikit-learn.

📋 Table of Contents

Project 1: Loan Approval Classification (Logistic Regression)
Project 2: Salary Prediction (Linear Regression)
Technologies Used
Installation
Usage

🏦 Project 1: Loan Approval Classification (Logistic Regression)

Overview

A binary classification project that predicts whether a loan application will be approved or rejected based on applicant information and loan characteristics.

Dataset Description

Source: Kaggle - Loan Approval Classification Data
Size: 45,000 records with 14 features
Target Variable: loan_status (1 = Approved, 0 = Rejected)

Features:

Demographic Information:
- person_age: Age of the applicant
- person_gender: Gender (male/female)
- person_education: Education level (High School, Associate, Bachelor, Master, Doctorate)
- person_income: Annual income
- person_emp_exp: Employment experience in years
- person_home_ownership: Home ownership status (RENT, OWN, MORTGAGE, OTHER)
Loan Information:
- loan_amnt: Loan amount requested
- loan_intent: Purpose of loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION)
- loan_int_rate: Interest rate
- loan_percent_income: Loan amount as percentage of income
Credit Information:
- cb_person_cred_hist_length: Credit history length
- credit_score: Credit score
- previous_loan_defaults_on_file: Previous loan defaults (Yes/No)

Problem Statement

Predict whether a loan application will be approved based on the applicant's demographic information, loan characteristics, and credit history. This is a binary classification problem using Logistic Regression.

Exploratory Data Analysis (EDA)

Data Quality Check: No missing values found in the dataset
Data Distribution Analysis:
- Visualized categorical variable distributions using bar plots with percentages
- Analyzed demographic patterns across gender, education, home ownership, and loan intent
Outlier Detection: Identified outliers in numerical features using boxplots and IQR method:
- person_age: 4.86% outliers
- person_income: 4.93% outliers
- person_emp_exp: 3.83% outliers
- loan_amnt: 5.22% outliers
- loan_int_rate: 0.28% outliers
- loan_percent_income: 1.65% outliers
- cb_person_cred_hist_length: 3.04% outliers
- credit_score: 1.04% outliers

Feature Engineering

1. Outlier Treatment

Applied capping method (winsorization) to handle outliers in:

loan_amnt
person_income
loan_percent_income

Outliers were capped at the upper and lower bounds calculated using IQR (Interquartile Range) method.

2. Encoding Techniques

Binary Encoding:
- person_gender: female = 0, male = 1
- previous_loan_defaults_on_file: No = 0, Yes = 1
Ordinal Encoding:
- person_education:
  - High School = 0
  - Associate = 1
  - Bachelor = 2
  - Master = 3
  - Doctorate = 4
One-Hot Encoding:
- person_home_ownership: Created dummy variables (OTHER, OWN, RENT)
- loan_intent: Created dummy variables (EDUCATION, HOMEIMPROVEMENT, MEDICAL, PERSONAL, VENTURE)

3. Feature Scaling

Applied StandardScaler to normalize numerical features for better model performance.

4. Multicollinearity Check

Calculated Variance Inflation Factor (VIF) to detect multicollinearity among features.

Model Training

Algorithm: Logistic Regression
Cross-Validation: K-Fold Cross-Validation to ensure model generalization
Evaluation Metrics: Classification report including precision, recall, F1-score, and accuracy

Results

The model successfully classifies loan approval status with evaluation metrics reported through classification_report. Detailed performance metrics and confusion matrix analysis were performed to assess model accuracy.

Key Techniques Demonstrated

✅ Exploratory Data Analysis (EDA)
✅ Outlier Detection and Treatment
✅ Multiple Encoding Techniques (Binary, Ordinal, One-Hot)
✅ Feature Scaling (StandardScaler)
✅ Multicollinearity Analysis (VIF)
✅ K-Fold Cross-Validation
✅ Model Evaluation with Classification Metrics

💰 Project 2: Salary Prediction (Linear Regression)

Overview

A regression project that predicts employee salary based on age and years of experience using Linear Regression.

Dataset Description

Source: Kaggle - Salary Prediction for Beginner
Features:
- Age: Age of the employee
- Gender: Gender of the employee
- Education Level: Highest education level
- Job Title: Job position
- Years of Experience: Total work experience
- Salary: Target variable (annual salary)

Problem Statement

Predict employee salary based on Age and Years of Experience. This is a continuous regression problem using Linear Regression.

Data Preprocessing

Missing Value Handling: Removed all rows with missing values
Feature Selection: Selected Age and Years of Experience as predictor variables
Train-Test Split: 80% training, 20% testing (random_state=42)

Model Training

Algorithm: Linear Regression
Cross-Validation: 25-fold K-Fold Cross-Validation with shuffling

Model Coefficients

The trained model learned the following relationships:

Age Coefficient: 2657.46 (impact of age on salary)
Years of Experience Coefficient: 4008.06 (impact of experience on salary)

Results

Performance Metrics:

Cross-Validation Score: 84.67%
Training Score: 86.24%
Test Score (R² Score): 88.85%
Mean Squared Error (MSE): 267,299,022.86
Root Mean Squared Error (RMSE): 16,349.28
Mean Absolute Error (MAE): 12,358.46

The model demonstrates excellent performance with an R² score of 88.85%, indicating that 88.85% of the variance in salary can be explained by age and years of experience.

Visualization

Actual vs Predicted Scatter Plot: Visualizes model predictions against actual salaries with a 45-degree reference line showing perfect prediction.

Key Techniques Demonstrated

✅ Data Cleaning (Missing Value Handling)
✅ Feature Selection
✅ Train-Test Split
✅ K-Fold Cross-Validation (25 folds)
✅ Linear Regression Model Training
✅ Multiple Evaluation Metrics (R², MSE, RMSE, MAE)
✅ Result Visualization with Seaborn

🛠️ Technologies Used

Python 3.13.0
pandas: Data manipulation and analysis
numpy: Numerical computations
scikit-learn: Machine learning algorithms and tools
seaborn: Statistical data visualization
matplotlib: Plotting and visualization
statsmodels: Statistical modeling (VIF calculation)
joblib: Model serialization
Jupyter Notebook: Interactive development environment

📦 Installation

Clone the repository:

git clone https://github.com/AbdulHadi17/Regression-Projects-ML.git
cd Regression-Projects-ML

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
models/loan-proj		models/loan-proj
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Regression Projects

📋 Table of Contents

🏦 Project 1: Loan Approval Classification (Logistic Regression)

Overview

Dataset Description

Features:

Problem Statement

Exploratory Data Analysis (EDA)

Feature Engineering

1. Outlier Treatment

2. Encoding Techniques

3. Feature Scaling

4. Multicollinearity Check

Model Training

Results

Key Techniques Demonstrated

💰 Project 2: Salary Prediction (Linear Regression)

Overview

Dataset Description

Problem Statement

Data Preprocessing

Model Training

Model Coefficients

Results

Performance Metrics:

Visualization

Key Techniques Demonstrated

🛠️ Technologies Used

📦 Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages