This project focuses on predicting a studentβs Math score using factors such as reading score, writing score, gender, parental education, lunch type, and test preparation status.
The goal is to showcase a realistic, production-ready machine learning workflow rather than a simple notebook-based experiment.
- Streamlit Cloud: https://student-score-prediction-ml-zfcmhmfohejlxjlfq5kq8y.streamlit.app/
- Hugging Face Spaces: https://huggingface.co/spaces/Alamin-refat/student-score-prediction
- Lasso Regression: Implements L1 Regularization to enhance model generalization and prevent overfitting.
- Predictive Analytics: Accurately predicts student scores based on providing data-driven academic insights.
- Data Visualization: Includes EDA with scatter plots and regression lines for clear data insights.
- Performance Metrics: Evaluated using Mean Absolute Error (MAE) to ensure prediction accuracy.
- Streamlined Pipeline: Efficient workflow covering data preprocessing, model training, and testing.
The dataset contains student performance records with the following attributes:
- Gender
- Race/Ethnicity
- Parental level of education
- Lunch type
- Test preparation course
- Reading score
- Writing score
- Math score (target variable)
Note: EDA was performed on all score variables, and Math score was selected as the final prediction target to ensure a leakage-free modeling pipeline.
Key EDA steps included:
- Distribution analysis of student scores
- Outlier detection using box plots
- Numerical feature correlation analysis
- Categorical feature impact analysis
- Correlation heatmaps to identify relationships between scores
Insights from EDA guided feature selection and modeling decisions.
The project implements a systematic Machine Learning pipeline to ensure high-performance and reliable predictions.
| Stage | Process | Key Tools |
|---|---|---|
| 01. Data Acquisition | Importing & cleaning student datasets | Pandas |
| 02. EDA | Distribution analysis & outlier detection | Seaborn, Matplotlib |
| 03. Preprocessing | Categorical encoding & Train-Test Split | Scikit-Learn |
| 04. Modeling | Lasso Regression with L1 Regularization | Scikit-Learn |
| 05. Evaluation | Performance tracking (MAE, RΒ²) | NumPy, Sklearn.metrics |
| 06. Deployment | Containerization & Cloud hosting | Docker, Streamlit |
-
π₯ Data Acquisition Integrated student performance datasets using
Pandasfor structured data handling and cleaning. -
π Exploratory Data Analysis (EDA) Leveraged
MatplotlibandSeabornto analyze score distributions, detect outliers via box plots, and identify linear relationships between reading, writing, and math scores. -
βοΈ Data Preprocessing
- Handled categorical encoding for features like Gender, Parental Education, and Lunch type.
- Performed Train-Test Split (80/20) to ensure unbiased model validation and prevent data leakage.
-
π€ Model Training Trained a Lasso Regression model. We utilized L1 Regularization to shrink less important coefficients, which helps in automatic feature selection and prevents overfitting.
-
π Model Evaluation
- Verified performance using MAE (~4.21) and RΒ² Score (~0.88).
- Conducted a comparative analysis against Ridge and Linear regression to ensure Lasso provided the most generalized fit.
-
π Deployment Developed an interactive dashboard using Streamlit and containerized the application via Docker for seamless deployment on Hugging Face Spaces and Streamlit Cloud.
| Metric | Value |
|---|---|
| MAE | ~4.21 |
| RMSE | ~5.39 |
| RΒ² Score | ~0.88 |
The model explains approximately 88% of the variance in student math scores.
The following regression models were trained and evaluated:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
Lasso Regression was selected due to:
- Strong generalization performance
- Built-in regularization
- Automatic feature selection
- High interpretability of coefficients
Model coefficients were analyzed to interpret feature impact.
Key observations:
- Writing score and reading score are the strongest predictors
- Gender and lunch type have noticeable influence
- Parental education level contributes moderately
- The model remains interpretable and explainable
- Python
- Pandas, NumPy
- Scikit-learn
- Matplotlib, Seaborn
- Streamlit
- Docker (for Hugging Face deployment)
A detailed look at the repository's organization:
Student-Score-Prediction-ML/
βββ .devcontainer/ # Development container configuration
βββ .ipynb_checkpoints/ # Jupyter notebook checkpoints
βββ Data/ # Dataset directory
βββ assets/ # Images/GIFs for README
βββ README.md # Main project documentation
βββ Dockerfile # Docker configuration for deployment
βββ app.py # Streamlit web application
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License file
βββ student_score_model.pkl # Trained model (pickle file)
βββ student_score_prediction.ipynb # Main Jupyter notebook with full ML pipeline
Follow these steps to set up the project locally on your machine:
Open your terminal or command prompt and run:
git clone [https://github.com/Alamin-refat/Student-Score-Prediction-ML.git](https://github.com/Alamin-refat/Student-Score-Prediction-ML.git)
cd Student-Score-Prediction-MLThis keeps the project dependencies isolated and prevents conflicts:
# For Windows
python -m venv venv
venv\Scripts\activate
# For Mac/Linux
python3 -m venv venv
source venv/bin/activateInstall all necessary dependencies listed in the requirements file to ensure the environment is ready:
pip install -r requirements.txtLaunch the web application locally to interact with the Student Math Score prediction model:
streamlit run app.pyThe application is deployed on two platforms:
- Direct deployment using Streamlit
- Public live demo for real-time predictions
- Dockerized Streamlit application
- Demonstrates cross-platform deployment capability
During deployment, several real-world challenges were addressed:
- Python and library version mismatches
- Dependency installation failures
- Windows-to-Linux environment differences
This project is in its initial phase, and I plan to scale it with the following enhancements:
- Advanced Algorithms: Beyond Lasso, I aim to implement ElasticNet, Ridge, and Ensemble Methods (like Random Forest) to compare performance.
- AutoML Integration: Incorporate PyCaret or Auto-Sklearn to automate the model selection and hyperparameter tuning process.
- Feature Engineering: Add more variables such as Previous Grades, Attendance Percentage, and Sleep Hours to increase prediction precision.
- Modern Frontend: Transition from basic HTML to a reactive framework like React.js or Streamlit for a smoother user experience.
- Database Integration: Implement SQLite or PostgreSQL to store user inputs and predicted results for historical tracking.
- Cloud Native: Containerize the application using Docker and deploy it on AWS/Azure using a robust CI/CD pipeline.
- Live Dashboard: Add an interactive dashboard using Plotly or Dash to visualize student progress and trends.
- Drift Detection: Implement model monitoring to detect "Data Drift" and trigger retraining when student performance patterns change.
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial use: You can use this software for commercial purposes.
- β Modification: You can modify the code.
- β Distribution: You can distribute the code to others.
- β Private use: You can use it privately for your own projects.
If you have any questions, feedback, or would like to discuss potential collaborations, feel free to reach out!
Alamin Refat Aspiring Data Scientist & Machine Learning Enthusiast
