This repository contains two comprehensive machine learning projects demonstrating both Linear Regression and Logistic Regression techniques using scikit-learn.
- Project 1: Loan Approval Classification (Logistic Regression)
- Project 2: Salary Prediction (Linear Regression)
- Technologies Used
- Installation
- Usage
A binary classification project that predicts whether a loan application will be approved or rejected based on applicant information and loan characteristics.
- Source: Kaggle - Loan Approval Classification Data
- Size: 45,000 records with 14 features
- Target Variable:
loan_status(1 = Approved, 0 = Rejected)
-
Demographic Information:
person_age: Age of the applicantperson_gender: Gender (male/female)person_education: Education level (High School, Associate, Bachelor, Master, Doctorate)person_income: Annual incomeperson_emp_exp: Employment experience in yearsperson_home_ownership: Home ownership status (RENT, OWN, MORTGAGE, OTHER)
-
Loan Information:
loan_amnt: Loan amount requestedloan_intent: Purpose of loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION)loan_int_rate: Interest rateloan_percent_income: Loan amount as percentage of income
-
Credit Information:
cb_person_cred_hist_length: Credit history lengthcredit_score: Credit scoreprevious_loan_defaults_on_file: Previous loan defaults (Yes/No)
Predict whether a loan application will be approved based on the applicant's demographic information, loan characteristics, and credit history. This is a binary classification problem using Logistic Regression.
-
Data Quality Check: No missing values found in the dataset
-
Data Distribution Analysis:
- Visualized categorical variable distributions using bar plots with percentages
- Analyzed demographic patterns across gender, education, home ownership, and loan intent
-
Outlier Detection: Identified outliers in numerical features using boxplots and IQR method:
person_age: 4.86% outliersperson_income: 4.93% outliersperson_emp_exp: 3.83% outliersloan_amnt: 5.22% outliersloan_int_rate: 0.28% outliersloan_percent_income: 1.65% outlierscb_person_cred_hist_length: 3.04% outlierscredit_score: 1.04% outliers
Applied capping method (winsorization) to handle outliers in:
loan_amntperson_incomeloan_percent_income
Outliers were capped at the upper and lower bounds calculated using IQR (Interquartile Range) method.
-
Binary Encoding:
person_gender: female = 0, male = 1previous_loan_defaults_on_file: No = 0, Yes = 1
-
Ordinal Encoding:
person_education:- High School = 0
- Associate = 1
- Bachelor = 2
- Master = 3
- Doctorate = 4
-
One-Hot Encoding:
person_home_ownership: Created dummy variables (OTHER, OWN, RENT)loan_intent: Created dummy variables (EDUCATION, HOMEIMPROVEMENT, MEDICAL, PERSONAL, VENTURE)
Applied StandardScaler to normalize numerical features for better model performance.
Calculated Variance Inflation Factor (VIF) to detect multicollinearity among features.
- Algorithm: Logistic Regression
- Cross-Validation: K-Fold Cross-Validation to ensure model generalization
- Evaluation Metrics: Classification report including precision, recall, F1-score, and accuracy
The model successfully classifies loan approval status with evaluation metrics reported through classification_report. Detailed performance metrics and confusion matrix analysis were performed to assess model accuracy.
β
Exploratory Data Analysis (EDA)
β
Outlier Detection and Treatment
β
Multiple Encoding Techniques (Binary, Ordinal, One-Hot)
β
Feature Scaling (StandardScaler)
β
Multicollinearity Analysis (VIF)
β
K-Fold Cross-Validation
β
Model Evaluation with Classification Metrics
A regression project that predicts employee salary based on age and years of experience using Linear Regression.
- Source: Kaggle - Salary Prediction for Beginner
- Features:
Age: Age of the employeeGender: Gender of the employeeEducation Level: Highest education levelJob Title: Job positionYears of Experience: Total work experienceSalary: Target variable (annual salary)
Predict employee salary based on Age and Years of Experience. This is a continuous regression problem using Linear Regression.
- Missing Value Handling: Removed all rows with missing values
- Feature Selection: Selected
AgeandYears of Experienceas predictor variables - Train-Test Split: 80% training, 20% testing (random_state=42)
- Algorithm: Linear Regression
- Cross-Validation: 25-fold K-Fold Cross-Validation with shuffling
The trained model learned the following relationships:
- Age Coefficient: 2657.46 (impact of age on salary)
- Years of Experience Coefficient: 4008.06 (impact of experience on salary)
- Cross-Validation Score: 84.67%
- Training Score: 86.24%
- Test Score (RΒ² Score): 88.85%
- Mean Squared Error (MSE): 267,299,022.86
- Root Mean Squared Error (RMSE): 16,349.28
- Mean Absolute Error (MAE): 12,358.46
The model demonstrates excellent performance with an RΒ² score of 88.85%, indicating that 88.85% of the variance in salary can be explained by age and years of experience.
- Actual vs Predicted Scatter Plot: Visualizes model predictions against actual salaries with a 45-degree reference line showing perfect prediction.
β
Data Cleaning (Missing Value Handling)
β
Feature Selection
β
Train-Test Split
β
K-Fold Cross-Validation (25 folds)
β
Linear Regression Model Training
β
Multiple Evaluation Metrics (RΒ², MSE, RMSE, MAE)
β
Result Visualization with Seaborn
- Python 3.13.0
- pandas: Data manipulation and analysis
- numpy: Numerical computations
- scikit-learn: Machine learning algorithms and tools
- seaborn: Statistical data visualization
- matplotlib: Plotting and visualization
- statsmodels: Statistical modeling (VIF calculation)
- joblib: Model serialization
- Jupyter Notebook: Interactive development environment
- Clone the repository:
git clone https://github.com/AbdulHadi17/Regression-Projects-ML.git
cd Regression-Projects-ML