This repository presents a product-oriented demand forecasting system built on NYC TLC FHVHV trip data. The focus is not just modeling accuracy, but decision-ready forecasts, clear evaluation, and product-relevant tradeoffs.
Please refer to read outs for details on EDA or Modeling or Data Engineering.
For Infra set up please review repo for Spark-Iceberg-MinIO set up.
Product question:
How much ride-hailing demand should we expect by area and time, and how reliable are those forecasts for operational decisions?
This project treats forecasting as a decision support problem, not a pure ML exercise.
Key product considerations:
- Forecasts must be stable, interpretable, and horizon-aware
- Evaluation must reflect how forecasts are consumed
- Models must support multiple planning cadences (daily vs hourly)
Raw Trip Events
↓
Validated & Cleaned Trips
↓
Time Alignment (Date / Hour)
↓
Spatial Hierarchy
(Zone → Borough → Cluster)
↓
Weather Context
(Rain / Snow / Temperature)
↓
Aggregated Demand Signals
↓
Decision-Ready Model Inputs
├─ Daily Planning (Prophet)
└─ Intraday Ops (LightGBM)
↓
Forecasts + OOS Scorecards
.
├── src/
│ └── nyc_tlc/
│ ├── etl/
│ │ ├── populate_basedata_base.py
│ │ ├── populate_basedata_rog.py
│ │ ├── populate_daily_summary.py
│ │
│ │
│ ├── helpers/
│ │ ├── interactive_maps.py
│ │ └── static_maps.py
│ │
│ ├── model_pipeline/
│ │ |── daily_borough_prophet.py
│ │ ├── daily_cluster_prophet.py
│ │ ├── hourly_cluster_gbm_cv.py
│ │ ├── hourly_cluster_gbm_ff.py
│ │
│ └── utils/
│ ├── extract_zone_weather.py
│ ├── loaders.py
│ ├── weather_downloads.py
│ └── weather2.py
│
├── notebooks/
│ ├── exploration/ -- Basic Trends, Spatial Demand Analysis, Trip Metrics ,Fare Economics/Surge, Weather
│ ├── models_prototyping/ -- Model building and pipeline prototyping
│ ├── models_evals/ -- Model Evaluation and OOS Testing
│ └── weather_spatial_data/ -- Notebook to download and Append weather based on zone centroids.
|
|
├── docs/ -- All .md format files for readouts
│ ├── basic_trends/ -- Basic Demand Trends
│ ├── data_engineering/ -- End to End date journey with detailed data flow
│ ├── fares_pricing/ -- Fare Economics and Surge Index
│ ├── modeling/ -- Prophet and LightGBM models
│ ├── one_pager/ -- EDA One Pager
│ ├── spatial_demand_analysis/ -- Analyze Trip flows across Manhattan
│ ├── trip_metrics/ -- Trip Distance, Trip Duration and Trip Speed Analysis
│ └── weather_effects/ -- Impact of Precipitation and Snowfall
│
│
├── readouts/ -- PDF docs meant for internal readout and reviews
│ ├── data_engg/ -- Data Engineering Read out.
│ ├── eda/ -- Exploratory Data Analysis and Results
│ └── modeling/ -- Read outs for Daily and Hourly Models
|
│
└── README.md
raw/ → immutable trip events
reference/ → zones, clusters, weather
processed/ → cleaned & enriched trips
model_inputs/ → forecast-ready aggregates
forecasts/ → predictions + evaluation
Daily Forecasts — Prophet
Used for: capacity planning, staffing, and medium-term trend visibility
-
Granularity: Borough and Cluster
-
Strengths:
- Interpretable trends and seasonality
- Stable multi-week forecasts
-
Tradeoff:
- Lower responsiveness to sudden intra-day shocks
Evaluation:
- Rolling cross-validation
- Horizon-specific error (7 / 14 / 28 days)
- Metrics reported in product-meaningful units (MAPE / WAPE)
Hourly Forecasts — LightGBM
Used for: intraday operations and near-term adjustments
-
Granularity: Cluster × Hour
-
Signals:
- Lagged demand
- Rolling demand context
- Weather conditions
- Calendar effects
-
Strengths:
- High short-term accuracy
- Better reaction to transient demand changes
-
Tradeoff:
- Less interpretable than additive time-series models
Evaluation:
- Short horizons (1–24h, 1–48h)
- Trip-weighted errors to reflect real impact
This project intentionally avoids “single aggregate accuracy.”
Instead:
- Metrics are computed at fixed, decision-relevant horizons
- Errors are weighted by trip volume
- Out-of-sample periods are clearly separated and reported
This mirrors how forecasts are actually reviewed in product, ops, and planning forums.
- Where demand is predictable vs inherently volatile
- How weather systematically shifts demand distribution
- When daily forecasts are sufficient vs when hourly models add value
- Tradeoffs between forecast stability and responsiveness
This repository showcases:
- Translating ambiguous product questions into measurable models
- Designing features that reflect real user and marketplace behavior
- Evaluating models the way decisions are made, not the way libraries default
- Communicating model limitations and tradeoffs clearly
✅ EDA complete with product-relevant hypotheses
✅ Feature engineering finalized
✅ Daily (Prophet) and Hourly (LightGBM) models finalized
✅ Out-of-sample scorecards complete
- Pricing and surge sensitivity modeling
- ETA prediction as a downstream consumer metric
- Scenario simulations (weather, holidays, shocks)
This project is licensed under the MIT License.