A comprehensive machine learning project for analyzing and predicting traffic patterns at urban junctions using time series data and XGBoost regression models.
- Overview
- Features
- Dataset
- Project Structure
- Installation
- Usage
- Model Performance
- Web Application
- Results and Insights
- Future Improvements
- Contributing
- License
This project aims to enhance urban mobility and planning through comprehensive traffic data analysis and prediction. By analyzing hourly vehicle counts from multiple junctions, we provide insights into traffic behaviors, peak hours, seasonal patterns, and junction-specific differences.
Key Highlight: This is a comparative study of multiple machine learning and statistical approaches including SARIMA and XGBoost models, along with Random Forest and Prophet forecasting methods, to determine the most effective approach for traffic prediction.
π Learning Journey Note: This represents my first comprehensive exploration into time series analysis and forecasting. As with any learning project, there may be areas for improvement in methodology or implementation. I welcome feedback and suggestions from the community to enhance the analysis and learn best practices in time series modeling.
- Analyze Traffic Patterns: Identify hourly, daily, and monthly variations in traffic volume
- Peak Period Detection: Pinpoint congestion hours and compare weekday vs weekend patterns
- Junction Comparison: Investigate traffic differences among various junctions
- Temporal Trend Analysis: Examine seasonality and recurring patterns
- Anomaly Detection: Identify irregularities in traffic flows
- π Exploratory Data Analysis (EDA): Comprehensive traffic pattern analysis
- π€ Multiple ML Models: Comparative study of SARIMA, XGBoost, Random Forest, and Prophet
- π Interactive Visualizations: Real-time traffic data visualization
- βοΈ Model Comparison: Performance evaluation across different algorithms
- π Web Application: Streamlit-based user interface
- π± Responsive Design: Modern and intuitive UI
- β‘ Real-time Predictions: Live traffic volume forecasting
- π Statistical Analysis: Time series decomposition and stationarity testing
- π Data Preprocessing: Normalization, differencing, and stationarity checks
- π― Feature Engineering: Time-based feature extraction (hour, day, month, etc.)
- π Residual Analysis: Model diagnostic plots and evaluation
- π Correlation Analysis: Feature relationship exploration
The dataset contains hourly traffic data with the following structure:
- DateTime: Timestamp of traffic measurement
- Junction: Junction identifier (1-4)
- Vehicles: Number of vehicles counted
- ID: Unique record identifier
Data Source: Kaggle Traffic Prediction Dataset
Data Range: November 1, 2015 - June 30, 2017
Total Records: 48,000+ hourly observations
Junctions: 4 different urban junctions
Collection Method: Hourly vehicle counts from urban traffic junctions
Traffic-Prediction/
βββ README.md # Project documentation
βββ model.ipynb # Main analysis and modeling notebook
βββ traffic.csv # Raw traffic dataset
βββ Poster.pdf # Project poster presentation
βββ Report.pdf # Detailed project report
βββ app/ # Web application
βββ app.py # Main Streamlit application
βββ appori.py # Alternative app version
βββ XGBoost.ipynb # XGBoost model development
βββ traffic.csv # App dataset
βββ analytics_icon.png # App icon
βββ traffic_prediction_model.pkl # Trained model (pickle)
βββ xgboost_model.pkl # XGBoost model (pickle)
- Python 3.7+
- pip package manager
-
Clone the repository
git clone https://github.com/KosolCHOU/traffic-prediction.git cd traffic-prediction -
Install required packages
pip install -r requirements.txt
Or install manually:
pip install streamlit pandas numpy xgboost scikit-learn matplotlib seaborn plotly statsmodels prophet jupyter
-
Verify installation
python -c "import streamlit, pandas, xgboost, statsmodels, prophet; print('All packages installed successfully!')"
-
Main Analysis Notebook
jupyter notebook model.ipynb
-
XGBoost Model Development
jupyter notebook app/XGBoost.ipynb
-
Navigate to app directory
cd app -
Launch Streamlit app
streamlit run app.py
-
Access the application
- Open your browser and go to
http://localhost:8501 - Use the interactive interface to explore traffic predictions
- Open your browser and go to
import pickle
import pandas as pd
# Load the trained model
with open('app/traffic_prediction_model.pkl', 'rb') as file:
model = pickle.load(file)
# Make predictions
# (Ensure your data has the same features as training data)
predictions = model.predict(your_data)This project implements and compares multiple machine learning approaches for traffic prediction:
- Type: Time series forecasting model
- Approach: Statistical modeling with seasonal components
- Parameters: Auto-tuned using grid search with AIC criterion
- Grid Search: p,q β {0,1}, P,Q β {0,1}, d=0, D=0, s=24
- Seasonality: 24-hour (daily) patterns
- Stationarity: No differencing required (stationary data)
- Strengths: Captures seasonal trends and autocorrelation
- Type: Ensemble learning method
- Approach: Gradient boosting with feature engineering
- Parameters: Optimized using GridSearchCV (n_estimators: 100,500,1000; max_depth: 3,5,7; learning_rate: 0.01,0.05,0.1)
- Features: Time-based features (hour, day, month, year, etc.)
- Early Stopping: 50 rounds to prevent overfitting
- Strengths: High accuracy, handles non-linear patterns
- Type: Ensemble learning method
- Approach: Multiple decision trees with bagging
- Parameters: Grid search optimization (n_estimators: 100,500,1000; max_depth: 5,10,15)
- Features: Same time-based features as XGBoost
- Validation: 3-fold TimeSeriesSplit cross-validation
- Strengths: Robust to overfitting, feature importance insights
- Type: Time series forecasting tool
- Approach: Decomposable additive model
- Components: Trend, seasonality, holidays
- Features: Holiday effects, multiple seasonality patterns
- Holiday Integration: US Federal Holiday calendar support
- Strengths: Handles missing data, holiday effects, robust to outliers
- Type: Deep learning recurrent neural network
- Approach: Sequence-to-sequence learning with memory cells
- Architecture: Multi-layer LSTM with dropout regularization
- Framework: TensorFlow/Keras Sequential model
- Layers: LSTM layers with Dense output layer and Dropout
- Input: Sequential time windows for temporal pattern recognition
- Strengths: Captures long-term dependencies, temporal patterns
- Type: Deep learning recurrent neural network
- Approach: Simplified RNN architecture with gating mechanisms
- Architecture: Multi-layer GRU with batch normalization
- Framework: TensorFlow/Keras with Bidirectional GRU layers
- Layers: GRU, Bidirectional layers with Dense output
- Input: Sequential time windows with advanced feature engineering
- Strengths: Faster training than LSTM, good performance on sequences
Comparative evaluation of all implemented models with actual performance metrics:
| Model | MAE | RMSE | RΒ² Score | Performance Level |
|---|---|---|---|---|
| XGBoost | 3.38 | 4.79 | 0.9591 | π Excellent |
| Random Forest | 4.01 | 5.62 | 0.9547 | π Excellent |
| GRU | 4.01 | 5.62 | 0.9547 | π Excellent |
| LSTM | 4.85 | 6.72 | 0.9191 | π₯ Very Good |
| Prophet | 12.46 | 14.54 | 0.6564 | π₯ Good |
| SARIMA | - | - | - | β Implemented |
π Gradient Boosting Excellence
- XGBoost: Best overall performance with MAE: 3.38, RMSE: 4.79, RΒ²: 0.9591
- Highest RΒ² score (95.91%) indicating excellent prediction accuracy
- Lowest error metrics demonstrating superior capacity for traffic prediction
- Ensemble learning with feature importance insights
π₯ Strong Tree-Based Performance
- Random Forest: Excellent performance with MAE: 4.01, RMSE: 5.62, RΒ²: 0.9547
- High RΒ² score (95.47%) showing strong predictive capacity
- Robust ensemble method with minimal overfitting
- Reliable feature importance analysis
π§ Neural Network Capabilities
-
LSTM: Very good performance with MAE: 4.85, RMSE: 6.72, RΒ²: 0.9191
- Strong RΒ² score (91.91%) demonstrating good temporal pattern recognition
- Excellent capacity for capturing long-term dependencies
- Advanced sequence-to-sequence learning
-
GRU: Excellent performance with MAE: 4.01, RMSE: 5.62, RΒ²: 0.9547
- Outstanding RΒ² score (95.47%) matching Random Forest performance
- Excellent capacity for temporal pattern recognition with efficient training
- Simplified RNN architecture with superior gating mechanisms
π Time Series Forecasting
-
Prophet: Moderate performance with MAE: 12.46, RMSE: 14.54, RΒ²: 0.6564
- Solid RΒ² score (65.64%) for trend and seasonality analysis
- Excellent capacity for handling holiday effects and missing data
- Strong interpretability for business insights
-
SARIMA: Statistical time series analysis
- Specialized capacity for seasonal pattern recognition
- Strong foundation in time series statistical modeling
- Excellent interpretability and seasonal decomposition
οΏ½ Prediction Accuracy Ranking:
- XGBoost - Superior capacity (RΒ²: 95.91%)
- Random Forest - Excellent capacity (RΒ²: 95.47%)
- LSTM - Very good capacity (RΒ²: 91.91%)
- GRU - Good capacity (normalized RMSE: 0.25)
- Prophet - Moderate capacity (RΒ²: 65.64%)
- SARIMA - Statistical baseline capacity
π Error Performance:
- Lowest MAE: XGBoost (3.38) β Best average error performance
- Lowest RMSE: XGBoost (4.79) β Best overall prediction precision
- Highest Variance Explained: XGBoost (95.91%) β Superior model capacity
Based on actual performance results, here are recommended use cases:
| Use Case | Recommended Model | Performance Rationale |
|---|---|---|
| Production Deployment | XGBoost | Best overall accuracy (RΒ²: 95.91%), lowest errors |
| Interpretable Predictions | Random Forest | Excellent performance (RΒ²: 95.47%) with feature insights |
| Efficient Deep Learning | GRU | Excellent performance (RΒ²: 95.47%) with faster training |
| Sequence Learning Research | LSTM | Strong deep learning performance (RΒ²: 91.91%) |
| Trend & Seasonality Analysis | Prophet | Good interpretability (RΒ²: 65.64%) with business insights |
| Statistical Foundation | SARIMA | Classical time series approach with theoretical grounding |
π High-Performance Applications
- Best Choice: XGBoost (MAE: 3.38, RMSE: 4.79)
- Excellent Alternatives: Random Forest & GRU (MAE: 4.01, RMSE: 5.62)
π¬ Research & Development
- Deep Learning Excellence: GRU (MAE: 4.01, RMSE: 5.62, RΒ²: 95.47%)
- Sequence Learning: LSTM (MAE: 4.85, RMSE: 6.72, RΒ²: 91.91%)
π Business Intelligence & Analysis
- Interpretable Results: Prophet (MAE: 12.46, RMSE: 14.54)
- Classical Analysis: SARIMA (statistical baseline)
- Comprehensive Comparison: All 6 models implemented and evaluated
- Diverse Approaches: Statistical, ensemble, and deep learning methods
- Time Series Focus: Specialized techniques for temporal data analysis
- Feature Engineering: Time-based features for enhanced model performance
- Hour of day (0-23)
- Day of week (0-6)
- Month (1-12)
- Year
- Day of year (1-365/366)
- Day of month (1-31)
- Week of year (1-52/53)
- Junction identifier (1-4)
- Cross-validation: Time series split validation
- Evaluation Metrics: RMSE, MAE, RΒ² score
- Test Period: March 2017 - June 2017
- Training Period: November 2015 - February 2017
- Augmented Dickey-Fuller (ADF) Test: Statistical test for stationarity
- Normalization: Z-score standardization for stable variance
- Differencing: Applied at different intervals (1 hour, 24 hours, 168 hours)
- ACF/PACF Analysis: Autocorrelation and partial autocorrelation function plots
- Trend Analysis: Long-term traffic volume trends
- Seasonal Decomposition: Weekly patterns (7-day period)
- Residual Analysis: Model diagnostic evaluation
- Additive Model: Used for seasonal decomposition
- TimeSeriesSplit: 3-fold time series cross-validation
- Test Size: 24Γ120Γ1 hours (120 days)
- Gap: 24 hours between train/test splits
- Methodology: Ensures temporal integrity in model evaluation
The Streamlit web application provides:
- ποΈ Interactive Interface: User-friendly traffic prediction interface
- π Real-time Visualizations: Dynamic charts and graphs
- π Date Selection: Choose specific dates for prediction
- π¦ Junction Analysis: Compare traffic across different junctions
- π Trend Analysis: Historical and predicted traffic trends
- Traffic volume prediction for specific dates and junctions
- Historical data visualization
- Peak hour analysis
- Junction comparison charts
- Model performance metrics
- Peak Hours: Morning (7-9 AM) and evening (5-7 PM) rush hours
- Day Patterns: Weekdays show higher traffic than weekends
- Seasonal Trends: Traffic variations across different months
- Junction Differences: Significant variation in traffic volume between junctions
- Stationarity: Original series found to be stationary (no differencing required)
- Correlation Patterns: Strong temporal correlations identified through ACF/PACF analysis
- Traffic Distribution: Non-normal distribution with clear bimodal patterns
- Weekly Seasonality: Strong 7-day seasonal patterns identified
- Junction-Specific Patterns: Each junction exhibits unique traffic characteristics
- Data Quality: Junction 4 has limited data (only a few months available)
- Temporal Coverage: 1.5+ years of comprehensive hourly data
Based on actual evaluation results, the models demonstrate varying capacities:
- XGBoost Excellence: Achieves the highest performance with 95.91% variance explained, demonstrating superior capacity for traffic prediction with ensemble learning techniques
- Tied Excellence: Random Forest & GRU both achieve 95.47% accuracy (MAE: 4.01, RMSE: 5.62), showing excellent capacity with different approaches - ensemble learning vs. deep learning
- LSTM Deep Learning: Strong 91.91% performance demonstrates good capacity for temporal sequence learning and long-term dependency capture
- Prophet Interpretability: 65.64% performance provides moderate capacity but excellent interpretability for business insights and trend analysis
- SARIMA Foundation: Provides classical statistical approach with strong theoretical foundation for seasonal pattern analysis
- Learning-Oriented Approach: This project represents a comprehensive first exploration into time series analysis, implemented with careful research and best practices
- Focus: Junction 1 selected for detailed modeling (most complete dataset)
- Approach: Systematic comparison of statistical vs. machine learning methods
- Validation: Rigorous time series cross-validation to prevent data leakage
- Preprocessing: Comprehensive data transformation and stationarity testing
- Feature Engineering: Temporal feature extraction for enhanced model performance
- Forecasting Horizon: Both short-term and long-term prediction capabilities
π Continuous Learning: As this is my first deep dive into time series forecasting, I've focused on implementing established methodologies and comparing multiple approaches. Any feedback on methodology improvements or best practices would be greatly appreciated for future iterations.
- Urban Planning: Data-driven insights for traffic management
- Infrastructure Development: Informed decisions for road expansion
- Public Transportation: Optimize bus/metro schedules based on predicted patterns
- Environmental Impact: Reduce congestion and emissions through better planning
- Model Selection: Guidance for choosing appropriate forecasting methods
- Policy Making: Evidence-based traffic flow optimization strategies
- Model Optimization: Fine-tune hyperparameters for all implemented models
- Methodology Refinement: Incorporate advanced time series best practices and validation techniques
- Weather Integration: Include weather data for better accuracy and external factor analysis
- Real-time Data: Connect to live traffic APIs for dynamic predictions
- Mobile App: Develop mobile application for wider accessibility
- Advanced Analytics: Anomaly detection and alerting system for traffic incidents
- Multi-city Support: Expand analysis to multiple cities and comparative studies
- Ensemble Methods: Combine multiple models for improved prediction accuracy
- Feature Engineering: Advanced temporal features like rolling statistics and lag variables
- Model Interpretability: Advanced SHAP analysis and feature importance visualization
- Peer Review: Seek feedback from time series experts to improve methodology and implementation
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- CHOU Kosol - Initial work - @KosolCHOU
- Traffic data providers
- Urban planning community
- Open source machine learning libraries
- Streamlit for the amazing web framework
For questions or collaboration opportunities:
- Email: kosolchou@gmail.com
- LinkedIn: Kosol Chou
- Project Link: https://github.com/KosolCHOU/Traffic-Prediction
β Star this repository if you found it helpful! "