NYC TLC Demand Forecasting — Product Data Science Case Study

This repository presents a product-oriented demand forecasting system built on NYC TLC FHVHV trip data. The focus is not just modeling accuracy, but decision-ready forecasts, clear evaluation, and product-relevant tradeoffs.

Please refer to read outs for details on EDA or Modeling or Data Engineering.

For Infra set up please review repo for Spark-Iceberg-MinIO set up.

1. Product Framing

Product question:

How much ride-hailing demand should we expect by area and time, and how reliable are those forecasts for operational decisions?

This project treats forecasting as a decision support problem, not a pure ML exercise.

Key product considerations:

Forecasts must be stable, interpretable, and horizon-aware
Evaluation must reflect how forecasts are consumed
Models must support multiple planning cadences (daily vs hourly)

2. End-to-End Analytical Flow


Raw Trip Events
        ↓
Validated & Cleaned Trips
        ↓
Time Alignment (Date / Hour)
        ↓
Spatial Hierarchy
  (Zone → Borough → Cluster)
        ↓
Weather Context
  (Rain / Snow / Temperature)
        ↓
Aggregated Demand Signals
        ↓
Decision-Ready Model Inputs
   ├─ Daily Planning (Prophet)
   └─ Intraday Ops (LightGBM)
        ↓
Forecasts + OOS Scorecards

3. Repository Structure

.
├── src/
│   └── nyc_tlc/
│       ├── etl/
│       │   ├── populate_basedata_base.py
│       │   ├── populate_basedata_rog.py
│       │   ├── populate_daily_summary.py
│       │   
│       │
│       ├── helpers/
│       │   ├── interactive_maps.py
│       │   └── static_maps.py
│       │
│       ├── model_pipeline/
│       │   |── daily_borough_prophet.py
│       │   ├── daily_cluster_prophet.py
│       │   ├── hourly_cluster_gbm_cv.py
│       │   ├── hourly_cluster_gbm_ff.py
│       │
│       └── utils/
│           ├── extract_zone_weather.py
│           ├── loaders.py
│           ├── weather_downloads.py
│           └── weather2.py
│
├── notebooks/
│   ├── exploration/    -- Basic Trends, Spatial Demand Analysis, Trip Metrics ,Fare Economics/Surge, Weather
│   ├── models_prototyping/       -- Model building and pipeline prototyping
│   ├── models_evals/             -- Model Evaluation and OOS Testing
│   └── weather_spatial_data/     -- Notebook to download and Append weather based on zone centroids.
|
|
├── docs/                -- All .md format files for readouts
│   ├── basic_trends/             -- Basic Demand Trends
│   ├── data_engineering/         -- End to End date journey with detailed data flow
│   ├── fares_pricing/            -- Fare Economics and Surge Index
│   ├── modeling/                 -- Prophet and LightGBM models
│   ├── one_pager/                -- EDA One Pager
│   ├── spatial_demand_analysis/  -- Analyze Trip flows across Manhattan
│   ├── trip_metrics/             -- Trip Distance, Trip Duration and Trip Speed Analysis
│   └── weather_effects/          -- Impact of Precipitation and Snowfall
│
│
├── readouts/       -- PDF docs meant for internal readout and reviews
│   ├── data_engg/                -- Data Engineering Read out.
│   ├── eda/                      -- Exploratory Data Analysis and Results
│   └── modeling/                 -- Read outs for Daily and Hourly Models
|
│
└── README.md

4. Data Organization (Conceptual)

raw/            → immutable trip events
reference/      → zones, clusters, weather
processed/      → cleaned & enriched trips
model_inputs/   → forecast-ready aggregates
forecasts/      → predictions + evaluation

5. Modeling Strategy (Product-Driven)

Daily Forecasts — Prophet

Used for: capacity planning, staffing, and medium-term trend visibility

Granularity: Borough and Cluster
Strengths:
- Interpretable trends and seasonality
- Stable multi-week forecasts
Tradeoff:
- Lower responsiveness to sudden intra-day shocks

Evaluation:

Rolling cross-validation
Horizon-specific error (7 / 14 / 28 days)
Metrics reported in product-meaningful units (MAPE / WAPE)

Hourly Forecasts — LightGBM

Used for: intraday operations and near-term adjustments

Granularity: Cluster × Hour
Signals:
- Lagged demand
- Rolling demand context
- Weather conditions
- Calendar effects
Strengths:
- High short-term accuracy
- Better reaction to transient demand changes
Tradeoff:
- Less interpretable than additive time-series models

Evaluation:

Short horizons (1–24h, 1–48h)
Trip-weighted errors to reflect real impact

6. Evaluation Philosophy (Product-First)

This project intentionally avoids “single aggregate accuracy.”

Instead:

Metrics are computed at fixed, decision-relevant horizons

Errors are weighted by trip volume

Out-of-sample periods are clearly separated and reported

This mirrors how forecasts are actually reviewed in product, ops, and planning forums.

7. Key Product Insights Enabled

Where demand is predictable vs inherently volatile
How weather systematically shifts demand distribution
When daily forecasts are sufficient vs when hourly models add value
Tradeoffs between forecast stability and responsiveness

8. What This Demonstrates as a Product Data Scientist

This repository showcases:

Translating ambiguous product questions into measurable models
Designing features that reflect real user and marketplace behavior
Evaluating models the way decisions are made, not the way libraries default
Communicating model limitations and tradeoffs clearly

9. Status

✅ EDA complete with product-relevant hypotheses

✅ Feature engineering finalized

✅ Daily (Prophet) and Hourly (LightGBM) models finalized

✅ Out-of-sample scorecards complete

10. Next Enhancements (Explicitly Product-Scoped)

Pricing and surge sensitivity modeling

ETA prediction as a downstream consumer metric

Scenario simulations (weather, holidays, shocks)

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
notebooks		notebooks
readouts		readouts
src/nyc_tlc		src/nyc_tlc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC TLC Demand Forecasting — Product Data Science Case Study

1. Product Framing

2. End-to-End Analytical Flow

3. Repository Structure

4. Data Organization (Conceptual)

5. Modeling Strategy (Product-Driven)

6. Evaluation Philosophy (Product-First)

7. Key Product Insights Enabled

8. What This Demonstrates as a Product Data Scientist

9. Status

10. Next Enhancements (Explicitly Product-Scoped)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYC TLC Demand Forecasting — Product Data Science Case Study

1. Product Framing

2. End-to-End Analytical Flow

3. Repository Structure

4. Data Organization (Conceptual)

5. Modeling Strategy (Product-Driven)

6. Evaluation Philosophy (Product-First)

7. Key Product Insights Enabled

8. What This Demonstrates as a Product Data Scientist

9. Status

10. Next Enhancements (Explicitly Product-Scoped)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages