Skip to content

KosolCHOU/Traffic-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🚦 Traffic Prediction Project

A comprehensive machine learning project for analyzing and predicting traffic patterns at urban junctions using time series data and XGBoost regression models.

πŸ“‹ Table of Contents

🎯 Overview

This project aims to enhance urban mobility and planning through comprehensive traffic data analysis and prediction. By analyzing hourly vehicle counts from multiple junctions, we provide insights into traffic behaviors, peak hours, seasonal patterns, and junction-specific differences.

Key Highlight: This is a comparative study of multiple machine learning and statistical approaches including SARIMA and XGBoost models, along with Random Forest and Prophet forecasting methods, to determine the most effective approach for traffic prediction.

πŸ“ Learning Journey Note: This represents my first comprehensive exploration into time series analysis and forecasting. As with any learning project, there may be areas for improvement in methodology or implementation. I welcome feedback and suggestions from the community to enhance the analysis and learn best practices in time series modeling.

Key Objectives

  • Analyze Traffic Patterns: Identify hourly, daily, and monthly variations in traffic volume
  • Peak Period Detection: Pinpoint congestion hours and compare weekday vs weekend patterns
  • Junction Comparison: Investigate traffic differences among various junctions
  • Temporal Trend Analysis: Examine seasonality and recurring patterns
  • Anomaly Detection: Identify irregularities in traffic flows

✨ Features

  • πŸ“Š Exploratory Data Analysis (EDA): Comprehensive traffic pattern analysis
  • πŸ€– Multiple ML Models: Comparative study of SARIMA, XGBoost, Random Forest, and Prophet
  • πŸ“ˆ Interactive Visualizations: Real-time traffic data visualization
  • βš–οΈ Model Comparison: Performance evaluation across different algorithms
  • 🌐 Web Application: Streamlit-based user interface
  • πŸ“± Responsive Design: Modern and intuitive UI
  • ⚑ Real-time Predictions: Live traffic volume forecasting
  • πŸ“ Statistical Analysis: Time series decomposition and stationarity testing
  • πŸ”„ Data Preprocessing: Normalization, differencing, and stationarity checks
  • 🎯 Feature Engineering: Time-based feature extraction (hour, day, month, etc.)
  • πŸ“‹ Residual Analysis: Model diagnostic plots and evaluation
  • πŸ” Correlation Analysis: Feature relationship exploration

πŸ“Š Dataset

The dataset contains hourly traffic data with the following structure:

  • DateTime: Timestamp of traffic measurement
  • Junction: Junction identifier (1-4)
  • Vehicles: Number of vehicles counted
  • ID: Unique record identifier

Data Source: Kaggle Traffic Prediction Dataset
Data Range: November 1, 2015 - June 30, 2017
Total Records: 48,000+ hourly observations
Junctions: 4 different urban junctions
Collection Method: Hourly vehicle counts from urban traffic junctions

πŸ“ Project Structure

Traffic-Prediction/
β”œβ”€β”€ README.md                    # Project documentation
β”œβ”€β”€ model.ipynb                 # Main analysis and modeling notebook
β”œβ”€β”€ traffic.csv                 # Raw traffic dataset
β”œβ”€β”€ Poster.pdf                  # Project poster presentation
β”œβ”€β”€ Report.pdf                  # Detailed project report
└── app/                        # Web application
    β”œβ”€β”€ app.py                  # Main Streamlit application
    β”œβ”€β”€ appori.py               # Alternative app version
    β”œβ”€β”€ XGBoost.ipynb           # XGBoost model development
    β”œβ”€β”€ traffic.csv             # App dataset
    β”œβ”€β”€ analytics_icon.png      # App icon
    β”œβ”€β”€ traffic_prediction_model.pkl  # Trained model (pickle)
    └── xgboost_model.pkl       # XGBoost model (pickle)

πŸš€ Installation

Prerequisites

  • Python 3.7+
  • pip package manager

Setup

  1. Clone the repository

    git clone https://github.com/KosolCHOU/traffic-prediction.git
    cd traffic-prediction
  2. Install required packages

    pip install -r requirements.txt

    Or install manually:

    pip install streamlit pandas numpy xgboost scikit-learn matplotlib seaborn plotly statsmodels prophet jupyter
  3. Verify installation

    python -c "import streamlit, pandas, xgboost, statsmodels, prophet; print('All packages installed successfully!')"

πŸ’» Usage

Running the Jupyter Notebooks

  1. Main Analysis Notebook

    jupyter notebook model.ipynb
  2. XGBoost Model Development

    jupyter notebook app/XGBoost.ipynb

Running the Web Application

  1. Navigate to app directory

    cd app
  2. Launch Streamlit app

    streamlit run app.py
  3. Access the application

    • Open your browser and go to http://localhost:8501
    • Use the interactive interface to explore traffic predictions

Using the Models

import pickle
import pandas as pd

# Load the trained model
with open('app/traffic_prediction_model.pkl', 'rb') as file:
    model = pickle.load(file)

# Make predictions
# (Ensure your data has the same features as training data)
predictions = model.predict(your_data)

πŸ“ˆ Model Performance

This project implements and compares multiple machine learning approaches for traffic prediction:

πŸ€– Models Implemented

1. SARIMA (Seasonal AutoRegressive Integrated Moving Average)

  • Type: Time series forecasting model
  • Approach: Statistical modeling with seasonal components
  • Parameters: Auto-tuned using grid search with AIC criterion
  • Grid Search: p,q ∈ {0,1}, P,Q ∈ {0,1}, d=0, D=0, s=24
  • Seasonality: 24-hour (daily) patterns
  • Stationarity: No differencing required (stationary data)
  • Strengths: Captures seasonal trends and autocorrelation

2. XGBoost (Extreme Gradient Boosting)

  • Type: Ensemble learning method
  • Approach: Gradient boosting with feature engineering
  • Parameters: Optimized using GridSearchCV (n_estimators: 100,500,1000; max_depth: 3,5,7; learning_rate: 0.01,0.05,0.1)
  • Features: Time-based features (hour, day, month, year, etc.)
  • Early Stopping: 50 rounds to prevent overfitting
  • Strengths: High accuracy, handles non-linear patterns

3. Random Forest Regressor

  • Type: Ensemble learning method
  • Approach: Multiple decision trees with bagging
  • Parameters: Grid search optimization (n_estimators: 100,500,1000; max_depth: 5,10,15)
  • Features: Same time-based features as XGBoost
  • Validation: 3-fold TimeSeriesSplit cross-validation
  • Strengths: Robust to overfitting, feature importance insights

4. Prophet (Facebook's Time Series Forecasting)

  • Type: Time series forecasting tool
  • Approach: Decomposable additive model
  • Components: Trend, seasonality, holidays
  • Features: Holiday effects, multiple seasonality patterns
  • Holiday Integration: US Federal Holiday calendar support
  • Strengths: Handles missing data, holiday effects, robust to outliers

5. LSTM (Long Short-Term Memory)

  • Type: Deep learning recurrent neural network
  • Approach: Sequence-to-sequence learning with memory cells
  • Architecture: Multi-layer LSTM with dropout regularization
  • Framework: TensorFlow/Keras Sequential model
  • Layers: LSTM layers with Dense output layer and Dropout
  • Input: Sequential time windows for temporal pattern recognition
  • Strengths: Captures long-term dependencies, temporal patterns

6. GRU (Gated Recurrent Unit)

  • Type: Deep learning recurrent neural network
  • Approach: Simplified RNN architecture with gating mechanisms
  • Architecture: Multi-layer GRU with batch normalization
  • Framework: TensorFlow/Keras with Bidirectional GRU layers
  • Layers: GRU, Bidirectional layers with Dense output
  • Input: Sequential time windows with advanced feature engineering
  • Strengths: Faster training than LSTM, good performance on sequences

πŸ“Š Model Comparison Results

Comparative evaluation of all implemented models with actual performance metrics:

Model MAE RMSE RΒ² Score Performance Level
XGBoost 3.38 4.79 0.9591 πŸ† Excellent
Random Forest 4.01 5.62 0.9547 πŸ† Excellent
GRU 4.01 5.62 0.9547 πŸ† Excellent
LSTM 4.85 6.72 0.9191 πŸ₯ˆ Very Good
Prophet 12.46 14.54 0.6564 πŸ₯‰ Good
SARIMA - - - βœ… Implemented

πŸ“ˆ Performance Analysis

Top Performing Models

πŸ† Gradient Boosting Excellence

  • XGBoost: Best overall performance with MAE: 3.38, RMSE: 4.79, RΒ²: 0.9591
    • Highest RΒ² score (95.91%) indicating excellent prediction accuracy
    • Lowest error metrics demonstrating superior capacity for traffic prediction
    • Ensemble learning with feature importance insights

πŸ₯ˆ Strong Tree-Based Performance

  • Random Forest: Excellent performance with MAE: 4.01, RMSE: 5.62, RΒ²: 0.9547
    • High RΒ² score (95.47%) showing strong predictive capacity
    • Robust ensemble method with minimal overfitting
    • Reliable feature importance analysis

Deep Learning Models

🧠 Neural Network Capabilities

  • LSTM: Very good performance with MAE: 4.85, RMSE: 6.72, RΒ²: 0.9191

    • Strong RΒ² score (91.91%) demonstrating good temporal pattern recognition
    • Excellent capacity for capturing long-term dependencies
    • Advanced sequence-to-sequence learning
  • GRU: Excellent performance with MAE: 4.01, RMSE: 5.62, RΒ²: 0.9547

    • Outstanding RΒ² score (95.47%) matching Random Forest performance
    • Excellent capacity for temporal pattern recognition with efficient training
    • Simplified RNN architecture with superior gating mechanisms

Statistical & Specialized Models

πŸ“Š Time Series Forecasting

  • Prophet: Moderate performance with MAE: 12.46, RMSE: 14.54, RΒ²: 0.6564

    • Solid RΒ² score (65.64%) for trend and seasonality analysis
    • Excellent capacity for handling holiday effects and missing data
    • Strong interpretability for business insights
  • SARIMA: Statistical time series analysis

    • Specialized capacity for seasonal pattern recognition
    • Strong foundation in time series statistical modeling
    • Excellent interpretability and seasonal decomposition

Model Capacity Insights

οΏ½ Prediction Accuracy Ranking:

  1. XGBoost - Superior capacity (RΒ²: 95.91%)
  2. Random Forest - Excellent capacity (RΒ²: 95.47%)
  3. LSTM - Very good capacity (RΒ²: 91.91%)
  4. GRU - Good capacity (normalized RMSE: 0.25)
  5. Prophet - Moderate capacity (RΒ²: 65.64%)
  6. SARIMA - Statistical baseline capacity

πŸ“Š Error Performance:

  • Lowest MAE: XGBoost (3.38) β†’ Best average error performance
  • Lowest RMSE: XGBoost (4.79) β†’ Best overall prediction precision
  • Highest Variance Explained: XGBoost (95.91%) β†’ Superior model capacity

🎯 Model Selection Guidelines

Based on actual performance results, here are recommended use cases:

Use Case Recommended Model Performance Rationale
Production Deployment XGBoost Best overall accuracy (RΒ²: 95.91%), lowest errors
Interpretable Predictions Random Forest Excellent performance (RΒ²: 95.47%) with feature insights
Efficient Deep Learning GRU Excellent performance (RΒ²: 95.47%) with faster training
Sequence Learning Research LSTM Strong deep learning performance (RΒ²: 91.91%)
Trend & Seasonality Analysis Prophet Good interpretability (RΒ²: 65.64%) with business insights
Statistical Foundation SARIMA Classical time series approach with theoretical grounding

Performance-Based Recommendations

πŸš€ High-Performance Applications

  • Best Choice: XGBoost (MAE: 3.38, RMSE: 4.79)
  • Excellent Alternatives: Random Forest & GRU (MAE: 4.01, RMSE: 5.62)

πŸ”¬ Research & Development

  • Deep Learning Excellence: GRU (MAE: 4.01, RMSE: 5.62, RΒ²: 95.47%)
  • Sequence Learning: LSTM (MAE: 4.85, RMSE: 6.72, RΒ²: 91.91%)

πŸ“ˆ Business Intelligence & Analysis

  • Interpretable Results: Prophet (MAE: 12.46, RMSE: 14.54)
  • Classical Analysis: SARIMA (statistical baseline)

πŸ” Implementation Notes

  • Comprehensive Comparison: All 6 models implemented and evaluated
  • Diverse Approaches: Statistical, ensemble, and deep learning methods
  • Time Series Focus: Specialized techniques for temporal data analysis
  • Feature Engineering: Time-based features for enhanced model performance

🎯 Key Features Used

  • Hour of day (0-23)
  • Day of week (0-6)
  • Month (1-12)
  • Year
  • Day of year (1-365/366)
  • Day of month (1-31)
  • Week of year (1-52/53)
  • Junction identifier (1-4)

βš™οΈ Model Validation

  • Cross-validation: Time series split validation
  • Evaluation Metrics: RMSE, MAE, RΒ² score
  • Test Period: March 2017 - June 2017
  • Training Period: November 2015 - February 2017

πŸ”¬ Data Preprocessing & Analysis

Stationarity Testing

  • Augmented Dickey-Fuller (ADF) Test: Statistical test for stationarity
  • Normalization: Z-score standardization for stable variance
  • Differencing: Applied at different intervals (1 hour, 24 hours, 168 hours)
  • ACF/PACF Analysis: Autocorrelation and partial autocorrelation function plots

Time Series Decomposition

  • Trend Analysis: Long-term traffic volume trends
  • Seasonal Decomposition: Weekly patterns (7-day period)
  • Residual Analysis: Model diagnostic evaluation
  • Additive Model: Used for seasonal decomposition

Cross-Validation Strategy

  • TimeSeriesSplit: 3-fold time series cross-validation
  • Test Size: 24Γ—120Γ—1 hours (120 days)
  • Gap: 24 hours between train/test splits
  • Methodology: Ensures temporal integrity in model evaluation

🌐 Web Application

The Streamlit web application provides:

  • πŸŽ›οΈ Interactive Interface: User-friendly traffic prediction interface
  • πŸ“Š Real-time Visualizations: Dynamic charts and graphs
  • πŸ“… Date Selection: Choose specific dates for prediction
  • 🚦 Junction Analysis: Compare traffic across different junctions
  • πŸ“ˆ Trend Analysis: Historical and predicted traffic trends

App Features

  • Traffic volume prediction for specific dates and junctions
  • Historical data visualization
  • Peak hour analysis
  • Junction comparison charts
  • Model performance metrics

πŸ” Results and Insights

Key Findings

  1. Peak Hours: Morning (7-9 AM) and evening (5-7 PM) rush hours
  2. Day Patterns: Weekdays show higher traffic than weekends
  3. Seasonal Trends: Traffic variations across different months
  4. Junction Differences: Significant variation in traffic volume between junctions
  5. Stationarity: Original series found to be stationary (no differencing required)
  6. Correlation Patterns: Strong temporal correlations identified through ACF/PACF analysis

Data Analysis Insights

  • Traffic Distribution: Non-normal distribution with clear bimodal patterns
  • Weekly Seasonality: Strong 7-day seasonal patterns identified
  • Junction-Specific Patterns: Each junction exhibits unique traffic characteristics
  • Data Quality: Junction 4 has limited data (only a few months available)
  • Temporal Coverage: 1.5+ years of comprehensive hourly data

Model Performance Insights

Based on actual evaluation results, the models demonstrate varying capacities:

  • XGBoost Excellence: Achieves the highest performance with 95.91% variance explained, demonstrating superior capacity for traffic prediction with ensemble learning techniques
  • Tied Excellence: Random Forest & GRU both achieve 95.47% accuracy (MAE: 4.01, RMSE: 5.62), showing excellent capacity with different approaches - ensemble learning vs. deep learning
  • LSTM Deep Learning: Strong 91.91% performance demonstrates good capacity for temporal sequence learning and long-term dependency capture
  • Prophet Interpretability: 65.64% performance provides moderate capacity but excellent interpretability for business insights and trend analysis
  • SARIMA Foundation: Provides classical statistical approach with strong theoretical foundation for seasonal pattern analysis

Research Methodology

  • Learning-Oriented Approach: This project represents a comprehensive first exploration into time series analysis, implemented with careful research and best practices
  • Focus: Junction 1 selected for detailed modeling (most complete dataset)
  • Approach: Systematic comparison of statistical vs. machine learning methods
  • Validation: Rigorous time series cross-validation to prevent data leakage
  • Preprocessing: Comprehensive data transformation and stationarity testing
  • Feature Engineering: Temporal feature extraction for enhanced model performance
  • Forecasting Horizon: Both short-term and long-term prediction capabilities

πŸ“š Continuous Learning: As this is my first deep dive into time series forecasting, I've focused on implementing established methodologies and comparing multiple approaches. Any feedback on methodology improvements or best practices would be greatly appreciated for future iterations.

Business Impact

  • Urban Planning: Data-driven insights for traffic management
  • Infrastructure Development: Informed decisions for road expansion
  • Public Transportation: Optimize bus/metro schedules based on predicted patterns
  • Environmental Impact: Reduce congestion and emissions through better planning
  • Model Selection: Guidance for choosing appropriate forecasting methods
  • Policy Making: Evidence-based traffic flow optimization strategies

πŸš€ Future Improvements

  • Model Optimization: Fine-tune hyperparameters for all implemented models
  • Methodology Refinement: Incorporate advanced time series best practices and validation techniques
  • Weather Integration: Include weather data for better accuracy and external factor analysis
  • Real-time Data: Connect to live traffic APIs for dynamic predictions
  • Mobile App: Develop mobile application for wider accessibility
  • Advanced Analytics: Anomaly detection and alerting system for traffic incidents
  • Multi-city Support: Expand analysis to multiple cities and comparative studies
  • Ensemble Methods: Combine multiple models for improved prediction accuracy
  • Feature Engineering: Advanced temporal features like rolling statistics and lag variables
  • Model Interpretability: Advanced SHAP analysis and feature importance visualization
  • Peer Review: Seek feedback from time series experts to improve methodology and implementation

🀝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ‘₯ Authors

πŸ™ Acknowledgments

  • Traffic data providers
  • Urban planning community
  • Open source machine learning libraries
  • Streamlit for the amazing web framework

πŸ“ž Contact

For questions or collaboration opportunities:


⭐ Star this repository if you found it helpful! "

About

🚦 Comprehensive traffic prediction system using 6 ML models (XGBoost, Random Forest, GRU, LSTM, Prophet, SARIMA) achieving 95%+ accuracy. Features time series analysis, deep learning, and interactive Streamlit dashboard for urban mobility insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages