🚦 Traffic Prediction Project

A comprehensive machine learning project for analyzing and predicting traffic patterns at urban junctions using time series data and XGBoost regression models.

📋 Table of Contents

Overview
Features
Dataset
Project Structure
Installation
Usage
Model Performance
Web Application
Results and Insights
Future Improvements
Contributing
License

🎯 Overview

This project aims to enhance urban mobility and planning through comprehensive traffic data analysis and prediction. By analyzing hourly vehicle counts from multiple junctions, we provide insights into traffic behaviors, peak hours, seasonal patterns, and junction-specific differences.

Key Highlight: This is a comparative study of multiple machine learning and statistical approaches including SARIMA and XGBoost models, along with Random Forest and Prophet forecasting methods, to determine the most effective approach for traffic prediction.

📝 Learning Journey Note: This represents my first comprehensive exploration into time series analysis and forecasting. As with any learning project, there may be areas for improvement in methodology or implementation. I welcome feedback and suggestions from the community to enhance the analysis and learn best practices in time series modeling.

Key Objectives

Analyze Traffic Patterns: Identify hourly, daily, and monthly variations in traffic volume
Peak Period Detection: Pinpoint congestion hours and compare weekday vs weekend patterns
Junction Comparison: Investigate traffic differences among various junctions
Temporal Trend Analysis: Examine seasonality and recurring patterns
Anomaly Detection: Identify irregularities in traffic flows

✨ Features

📊 Exploratory Data Analysis (EDA): Comprehensive traffic pattern analysis
🤖 Multiple ML Models: Comparative study of SARIMA, XGBoost, Random Forest, and Prophet
📈 Interactive Visualizations: Real-time traffic data visualization
⚖️ Model Comparison: Performance evaluation across different algorithms
🌐 Web Application: Streamlit-based user interface
📱 Responsive Design: Modern and intuitive UI
⚡ Real-time Predictions: Live traffic volume forecasting
📝 Statistical Analysis: Time series decomposition and stationarity testing
🔄 Data Preprocessing: Normalization, differencing, and stationarity checks
🎯 Feature Engineering: Time-based feature extraction (hour, day, month, etc.)
📋 Residual Analysis: Model diagnostic plots and evaluation
🔍 Correlation Analysis: Feature relationship exploration

📊 Dataset

The dataset contains hourly traffic data with the following structure:

DateTime: Timestamp of traffic measurement
Junction: Junction identifier (1-4)
Vehicles: Number of vehicles counted
ID: Unique record identifier

Data Source: Kaggle Traffic Prediction Dataset
Data Range: November 1, 2015 - June 30, 2017
Total Records: 48,000+ hourly observations
Junctions: 4 different urban junctions
Collection Method: Hourly vehicle counts from urban traffic junctions

📁 Project Structure

Traffic-Prediction/
├── README.md                    # Project documentation
├── model.ipynb                 # Main analysis and modeling notebook
├── traffic.csv                 # Raw traffic dataset
├── Poster.pdf                  # Project poster presentation
├── Report.pdf                  # Detailed project report
└── app/                        # Web application
    ├── app.py                  # Main Streamlit application
    ├── appori.py               # Alternative app version
    ├── XGBoost.ipynb           # XGBoost model development
    ├── traffic.csv             # App dataset
    ├── analytics_icon.png      # App icon
    ├── traffic_prediction_model.pkl  # Trained model (pickle)
    └── xgboost_model.pkl       # XGBoost model (pickle)

🚀 Installation

Prerequisites

Python 3.7+
pip package manager

Setup

Clone the repository

git clone https://github.com/KosolCHOU/traffic-prediction.git
cd traffic-prediction

Install required packages

pip install -r requirements.txt

Or install manually:

pip install streamlit pandas numpy xgboost scikit-learn matplotlib seaborn plotly statsmodels prophet jupyter

Verify installation

python -c "import streamlit, pandas, xgboost, statsmodels, prophet; print('All packages installed successfully!')"

💻 Usage

Running the Jupyter Notebooks

Main Analysis Notebook
```
jupyter notebook model.ipynb
```
XGBoost Model Development
```
jupyter notebook app/XGBoost.ipynb
```

Running the Web Application

Navigate to app directory
```
cd app
```
Launch Streamlit app
```
streamlit run app.py
```
Access the application
- Open your browser and go to http://localhost:8501
- Use the interactive interface to explore traffic predictions

Using the Models

import pickle
import pandas as pd

# Load the trained model
with open('app/traffic_prediction_model.pkl', 'rb') as file:
    model = pickle.load(file)

# Make predictions
# (Ensure your data has the same features as training data)
predictions = model.predict(your_data)

📈 Model Performance

This project implements and compares multiple machine learning approaches for traffic prediction:

🤖 Models Implemented

1. SARIMA (Seasonal AutoRegressive Integrated Moving Average)

Type: Time series forecasting model
Approach: Statistical modeling with seasonal components
Parameters: Auto-tuned using grid search with AIC criterion
Grid Search: p,q ∈ {0,1}, P,Q ∈ {0,1}, d=0, D=0, s=24
Seasonality: 24-hour (daily) patterns
Stationarity: No differencing required (stationary data)
Strengths: Captures seasonal trends and autocorrelation

2. XGBoost (Extreme Gradient Boosting)

Type: Ensemble learning method
Approach: Gradient boosting with feature engineering
Parameters: Optimized using GridSearchCV (n_estimators: 100,500,1000; max_depth: 3,5,7; learning_rate: 0.01,0.05,0.1)
Features: Time-based features (hour, day, month, year, etc.)
Early Stopping: 50 rounds to prevent overfitting
Strengths: High accuracy, handles non-linear patterns

3. Random Forest Regressor

Type: Ensemble learning method
Approach: Multiple decision trees with bagging
Parameters: Grid search optimization (n_estimators: 100,500,1000; max_depth: 5,10,15)
Features: Same time-based features as XGBoost
Validation: 3-fold TimeSeriesSplit cross-validation
Strengths: Robust to overfitting, feature importance insights

4. Prophet (Facebook's Time Series Forecasting)

Type: Time series forecasting tool
Approach: Decomposable additive model
Components: Trend, seasonality, holidays
Features: Holiday effects, multiple seasonality patterns
Holiday Integration: US Federal Holiday calendar support
Strengths: Handles missing data, holiday effects, robust to outliers

5. LSTM (Long Short-Term Memory)

Type: Deep learning recurrent neural network
Approach: Sequence-to-sequence learning with memory cells
Architecture: Multi-layer LSTM with dropout regularization
Framework: TensorFlow/Keras Sequential model
Layers: LSTM layers with Dense output layer and Dropout
Input: Sequential time windows for temporal pattern recognition
Strengths: Captures long-term dependencies, temporal patterns

6. GRU (Gated Recurrent Unit)

Type: Deep learning recurrent neural network
Approach: Simplified RNN architecture with gating mechanisms
Architecture: Multi-layer GRU with batch normalization
Framework: TensorFlow/Keras with Bidirectional GRU layers
Layers: GRU, Bidirectional layers with Dense output
Input: Sequential time windows with advanced feature engineering
Strengths: Faster training than LSTM, good performance on sequences

📊 Model Comparison Results

Comparative evaluation of all implemented models with actual performance metrics:

Model	MAE	RMSE	R² Score	Performance Level
XGBoost	3.38	4.79	0.9591	🏆 Excellent
Random Forest	4.01	5.62	0.9547	🏆 Excellent
GRU	4.01	5.62	0.9547	🏆 Excellent
LSTM	4.85	6.72	0.9191	🥈 Very Good
Prophet	12.46	14.54	0.6564	🥉 Good
SARIMA	-	-	-	✅ Implemented

📈 Performance Analysis

Top Performing Models

🏆 Gradient Boosting Excellence

XGBoost: Best overall performance with MAE: 3.38, RMSE: 4.79, R²: 0.9591
- Highest R² score (95.91%) indicating excellent prediction accuracy
- Lowest error metrics demonstrating superior capacity for traffic prediction
- Ensemble learning with feature importance insights

🥈 Strong Tree-Based Performance

Random Forest: Excellent performance with MAE: 4.01, RMSE: 5.62, R²: 0.9547
- High R² score (95.47%) showing strong predictive capacity
- Robust ensemble method with minimal overfitting
- Reliable feature importance analysis

Deep Learning Models

🧠 Neural Network Capabilities

LSTM: Very good performance with MAE: 4.85, RMSE: 6.72, R²: 0.9191
- Strong R² score (91.91%) demonstrating good temporal pattern recognition
- Excellent capacity for capturing long-term dependencies
- Advanced sequence-to-sequence learning
GRU: Excellent performance with MAE: 4.01, RMSE: 5.62, R²: 0.9547
- Outstanding R² score (95.47%) matching Random Forest performance
- Excellent capacity for temporal pattern recognition with efficient training
- Simplified RNN architecture with superior gating mechanisms

Statistical & Specialized Models

📊 Time Series Forecasting

Prophet: Moderate performance with MAE: 12.46, RMSE: 14.54, R²: 0.6564
- Solid R² score (65.64%) for trend and seasonality analysis
- Excellent capacity for handling holiday effects and missing data
- Strong interpretability for business insights
SARIMA: Statistical time series analysis
- Specialized capacity for seasonal pattern recognition
- Strong foundation in time series statistical modeling
- Excellent interpretability and seasonal decomposition

Model Capacity Insights

� Prediction Accuracy Ranking:

XGBoost - Superior capacity (R²: 95.91%)
Random Forest - Excellent capacity (R²: 95.47%)
LSTM - Very good capacity (R²: 91.91%)
GRU - Good capacity (normalized RMSE: 0.25)
Prophet - Moderate capacity (R²: 65.64%)
SARIMA - Statistical baseline capacity

📊 Error Performance:

Lowest MAE: XGBoost (3.38) → Best average error performance
Lowest RMSE: XGBoost (4.79) → Best overall prediction precision
Highest Variance Explained: XGBoost (95.91%) → Superior model capacity

🎯 Model Selection Guidelines

Based on actual performance results, here are recommended use cases:

Use Case	Recommended Model	Performance Rationale
Production Deployment	XGBoost	Best overall accuracy (R²: 95.91%), lowest errors
Interpretable Predictions	Random Forest	Excellent performance (R²: 95.47%) with feature insights
Efficient Deep Learning	GRU	Excellent performance (R²: 95.47%) with faster training
Sequence Learning Research	LSTM	Strong deep learning performance (R²: 91.91%)
Trend & Seasonality Analysis	Prophet	Good interpretability (R²: 65.64%) with business insights
Statistical Foundation	SARIMA	Classical time series approach with theoretical grounding

Performance-Based Recommendations

🚀 High-Performance Applications

Best Choice: XGBoost (MAE: 3.38, RMSE: 4.79)
Excellent Alternatives: Random Forest & GRU (MAE: 4.01, RMSE: 5.62)

🔬 Research & Development

Deep Learning Excellence: GRU (MAE: 4.01, RMSE: 5.62, R²: 95.47%)
Sequence Learning: LSTM (MAE: 4.85, RMSE: 6.72, R²: 91.91%)

📈 Business Intelligence & Analysis

Interpretable Results: Prophet (MAE: 12.46, RMSE: 14.54)
Classical Analysis: SARIMA (statistical baseline)

🔍 Implementation Notes

Comprehensive Comparison: All 6 models implemented and evaluated
Diverse Approaches: Statistical, ensemble, and deep learning methods
Time Series Focus: Specialized techniques for temporal data analysis
Feature Engineering: Time-based features for enhanced model performance

🎯 Key Features Used

Hour of day (0-23)
Day of week (0-6)
Month (1-12)
Year
Day of year (1-365/366)
Day of month (1-31)
Week of year (1-52/53)
Junction identifier (1-4)

⚙️ Model Validation

Cross-validation: Time series split validation
Evaluation Metrics: RMSE, MAE, R² score
Test Period: March 2017 - June 2017
Training Period: November 2015 - February 2017

🔬 Data Preprocessing & Analysis

Stationarity Testing

Augmented Dickey-Fuller (ADF) Test: Statistical test for stationarity
Normalization: Z-score standardization for stable variance
Differencing: Applied at different intervals (1 hour, 24 hours, 168 hours)
ACF/PACF Analysis: Autocorrelation and partial autocorrelation function plots

Time Series Decomposition

Trend Analysis: Long-term traffic volume trends
Seasonal Decomposition: Weekly patterns (7-day period)
Residual Analysis: Model diagnostic evaluation
Additive Model: Used for seasonal decomposition

Cross-Validation Strategy

TimeSeriesSplit: 3-fold time series cross-validation
Test Size: 24×120×1 hours (120 days)
Gap: 24 hours between train/test splits
Methodology: Ensures temporal integrity in model evaluation

🌐 Web Application

The Streamlit web application provides:

🎛️ Interactive Interface: User-friendly traffic prediction interface
📊 Real-time Visualizations: Dynamic charts and graphs
📅 Date Selection: Choose specific dates for prediction
🚦 Junction Analysis: Compare traffic across different junctions
📈 Trend Analysis: Historical and predicted traffic trends

App Features

Traffic volume prediction for specific dates and junctions
Historical data visualization
Peak hour analysis
Junction comparison charts
Model performance metrics

🔍 Results and Insights

Key Findings

Peak Hours: Morning (7-9 AM) and evening (5-7 PM) rush hours
Day Patterns: Weekdays show higher traffic than weekends
Seasonal Trends: Traffic variations across different months
Junction Differences: Significant variation in traffic volume between junctions
Stationarity: Original series found to be stationary (no differencing required)
Correlation Patterns: Strong temporal correlations identified through ACF/PACF analysis

Data Analysis Insights

Traffic Distribution: Non-normal distribution with clear bimodal patterns
Weekly Seasonality: Strong 7-day seasonal patterns identified
Junction-Specific Patterns: Each junction exhibits unique traffic characteristics
Data Quality: Junction 4 has limited data (only a few months available)
Temporal Coverage: 1.5+ years of comprehensive hourly data

Model Performance Insights

Based on actual evaluation results, the models demonstrate varying capacities:

XGBoost Excellence: Achieves the highest performance with 95.91% variance explained, demonstrating superior capacity for traffic prediction with ensemble learning techniques
Tied Excellence: Random Forest & GRU both achieve 95.47% accuracy (MAE: 4.01, RMSE: 5.62), showing excellent capacity with different approaches - ensemble learning vs. deep learning
LSTM Deep Learning: Strong 91.91% performance demonstrates good capacity for temporal sequence learning and long-term dependency capture
Prophet Interpretability: 65.64% performance provides moderate capacity but excellent interpretability for business insights and trend analysis
SARIMA Foundation: Provides classical statistical approach with strong theoretical foundation for seasonal pattern analysis

Research Methodology

Learning-Oriented Approach: This project represents a comprehensive first exploration into time series analysis, implemented with careful research and best practices
Focus: Junction 1 selected for detailed modeling (most complete dataset)
Approach: Systematic comparison of statistical vs. machine learning methods
Validation: Rigorous time series cross-validation to prevent data leakage
Preprocessing: Comprehensive data transformation and stationarity testing
Feature Engineering: Temporal feature extraction for enhanced model performance
Forecasting Horizon: Both short-term and long-term prediction capabilities

📚 Continuous Learning: As this is my first deep dive into time series forecasting, I've focused on implementing established methodologies and comparing multiple approaches. Any feedback on methodology improvements or best practices would be greatly appreciated for future iterations.

Business Impact

Urban Planning: Data-driven insights for traffic management
Infrastructure Development: Informed decisions for road expansion
Public Transportation: Optimize bus/metro schedules based on predicted patterns
Environmental Impact: Reduce congestion and emissions through better planning
Model Selection: Guidance for choosing appropriate forecasting methods
Policy Making: Evidence-based traffic flow optimization strategies

🚀 Future Improvements

🤝 Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

👥 Authors

CHOU Kosol - Initial work - @KosolCHOU

🙏 Acknowledgments

Traffic data providers
Urban planning community
Open source machine learning libraries
Streamlit for the amazing web framework

📞 Contact

For questions or collaboration opportunities:

Email: kosolchou@gmail.com
LinkedIn: Kosol Chou
Project Link: https://github.com/KosolCHOU/Traffic-Prediction

⭐ Star this repository if you found it helpful! "

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
app		app
Poster.pdf		Poster.pdf
README.md		README.md
Report.pdf		Report.pdf
model.ipynb		model.ipynb
requirements.txt		requirements.txt
traffic.csv		traffic.csv

Folders and files

Latest commit

History

Repository files navigation