Predict startup profitability using Multiple Linear Regression with Backward Elimination — implemented in both Python and R.
Models how R&D Spend, Administration, Marketing Spend, and State influence a startup's Profit using the 50 Startups dataset.
- Loads and preprocesses data (one-hot encodes the categorical
Statecolumn) - Splits into 80/20 train/test sets
- Fits a Multiple Linear Regression model and reports R² / RMSE
- Performs Backward Elimination (statsmodels OLS) to find the optimal predictor subset
50_Startups.csv — 50 records with columns:
| Column | Description |
|---|---|
| R&D Spend | Research & development expenditure |
| Administration | Administrative costs |
| Marketing Spend | Marketing expenditure |
| State | New York, California, or Florida |
| Profit | Target variable |
| Language | Libraries | |
|---|---|---|
| 🐍 | Python 3.10+ | numpy · pandas · matplotlib · scikit-learn · statsmodels |
| 📊 | R | caTools |
pip install numpy pandas matplotlib scikit-learn statsmodels
python multiple_linear_regression.pyinstall.packages("caTools") # first time only
source("multiple_linear_regression.R")Both scripts expect
50_Startups.csvin the same directory.
- No cross-validation or hyperparameter tuning (simple demonstration project).
- The R script encodes
Stateas numeric factor levels (1, 2, 3), which may imply ordinality — acceptable forlm()with factor types but worth noting.
MIT © 2018 Kaustabh Ganguly