NYC Taxi Data Engineering and SQL Analytics using PostgreSQL

Author: Meherab Hossain Shafin
Department: Software Engineering
Institution: Daffodil International University
Project Type: Large-Scale Data Engineering & SQL Analytics
Dataset Size: 40,435,831 records (~3.9 GB)

Project Overview

This project implements a scalable data engineering pipeline to process and analyze the New York City Yellow Taxi dataset (2024).

The system ingests raw transportation data, transforms it into a relational database format, and performs optimized analytical queries using PostgreSQL.

The objective is to demonstrate three professional competencies:

Area	Demonstration
Data Engineering	Large-scale dataset ingestion pipeline
Database Engineering	Schema design and indexing optimization
Data Analytics	SQL-based insight extraction from millions of records

The final system processes over 40 million taxi trip records and executes analytical queries efficiently using indexed relational structures.

Dataset Source

Dataset: NYC Taxi & Limousine Commission (TLC) Trip Record Data

Official Source
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

The dataset contains detailed trip-level taxi data including pickup time, dropoff time, passenger counts, travel distance, pickup locations, and fare information.

Dataset Attributes

Column	Description
tpep_pickup_datetime	Trip start timestamp
tpep_dropoff_datetime	Trip end timestamp
passenger_count	Number of passengers
trip_distance	Distance traveled
pulocationid	Pickup taxi zone
dolocationid	Dropoff taxi zone
fare_amount	Base fare
total_amount	Total fare including fees

Dataset Size

Stage	Description	Size
Raw Parquet files	Monthly TLC datasets	~60 MB each
Combined dataset	All 2024 taxi trips	~40 million rows
Cleaned CSV dataset	Consolidated dataset for ETL	~3.9 GB

Total rows loaded into PostgreSQL:

40,435,831 taxi trips

Python Dependencies

The project requires the following Python libraries.

Library	Purpose
pandas	Data cleaning and transformation
pyarrow	Reading Parquet files
psycopg2	PostgreSQL database connectivity

Install dependencies using:

pip install -r requirements.txt
# System Architecture

The project follows a structured **data engineering pipeline**.

Parquet Dataset
│
Python Processing Scripts
│
Clean CSV Dataset (~3.9GB)
│
PostgreSQL ETL Pipeline
│
Analytics Table
│
SQL Analytical Queries

Each stage performs a specific transformation that ensures efficient storage and query performance.

Repository Structure

nyc-taxi-data-engineering-postgresql
│
├── scripts
│   ├── phase1_parquet_to_clean_csv.py
│   └── phase2_postgres_etl.py
│
├── reports
│   └── phase1_processing_summary.json
│
├── docs
│   └── taxi_data_engineering_report.pdf
│
├── sql
│   └── analytical_queries.sql
│
└── README.md

Software Stack

Tool	Purpose
Python	Data processing pipeline
Pandas	Data transformation
PyArrow	Parquet file reading
PostgreSQL	Relational data storage
pgAdmin	Database management
SQL	Analytical query execution

Data Processing Pipeline

The project consists of two main processing phases.

Phase 1 — Parquet to CSV Processing

Script:

phase1_parquet_to_clean_csv.py

Purpose:

Load monthly Parquet files
Validate schema consistency
Clean invalid rows
Normalize timestamp fields
Merge all months into a single dataset
Export a consolidated CSV dataset

Output file:

yellow_taxi_2024_cleaned.csv

Size:

~3.9 GB

The processing script uses chunk-based data handling to prevent memory overflow when processing multi-gigabyte datasets.

Phase 2 — PostgreSQL ETL Pipeline

Script:

phase2_postgres_etl.py

The ETL workflow performs:

Database connection initialization
Creation of staging tables
Bulk loading using PostgreSQL COPY
Data type conversion
Transfer into analytics schema

This staged ingestion strategy improves reliability and ensures data integrity.

Database Architecture

The PostgreSQL database is organized into two schema layers.

staging.yellow_taxi_trips_raw
│
│ transformation
▼
analytics.yellow_taxi_trips_2024

Staging Layer

Purpose:

Raw data ingestion
Fast bulk loading
Minimal transformation

Analytics Layer

Purpose:

Cleaned dataset
Typed relational structure
Optimized analytical queries

Query Optimization

To improve performance on large analytical queries, several indexes were created.

Time Index

CREATE INDEX idx_pickup_datetime
ON analytics.yellow_taxi_trips_2024 (tpep_pickup_datetime);

Purpose: accelerate time-range queries.

Location Index

CREATE INDEX idx_pulocationid
ON analytics.yellow_taxi_trips_2024 (pulocationid);

Purpose: optimize spatial grouping queries.

Composite Index

CREATE INDEX idx_pickup_location
ON analytics.yellow_taxi_trips_2024 (tpep_pickup_datetime, pulocationid);

Purpose: accelerate queries filtering by both location and time.

Analytical Queries

The dataset was analyzed using SQL queries executed directly on the PostgreSQL database.

These queries reveal patterns in taxi demand, passenger behavior, and revenue distribution.

Key analytical topics include:

Taxi demand by pickup zone
Monthly trip volume trends
Average trip distance patterns
Passenger count distribution
Revenue concentration across zones

Optimization over Raw Output

A simple revenue calculation could be performed using spreadsheets or basic scripts.
However, this project focuses on demonstrating scalable analytical engineering practices.

Scalability

The system processes over 40 million records efficiently within PostgreSQL.

Indexing

Instead of scanning the entire dataset sequentially, PostgreSQL uses B-Tree indexes to retrieve relevant records quickly.

Performance Improvement

Query Type	Before Index	After Index
Time filtered query	~1.2 seconds	~0.05 seconds

This demonstrates how proper indexing dramatically improves query performance.

Analytical Insights

Key findings from the analysis include:

Taxi demand is concentrated in airport and commercial zones
JFK Airport (Zone 132) is the most active pickup zone
Taxi demand increases during spring and autumn months
Single-passenger trips dominate taxi usage
Airport zones generate a large share of total taxi revenue

These insights illustrate how transportation datasets can support urban mobility analysis.

Engineering Competency Demonstrated

Tier	Output	What it Demonstrates
Database Engineering	SQL schema and indexing	Ability to design optimized relational databases
Data Analytics	Temporal and spatial trends	Ability to extract insights from large datasets
Communication	Performance benchmarking	Ability to document technical systems clearly

Lessons Learned

This project provided several important engineering insights:

Large datasets require careful memory-efficient ingestion
Indexing strategy significantly impacts query performance
Real-world datasets often contain data inconsistencies
PostgreSQL can efficiently support large analytical workloads
Materialized summaries improve repeated analytical queries

Conclusion

This project demonstrates a scalable data engineering workflow for processing and analyzing large transportation datasets.

By combining Python data processing, PostgreSQL relational storage, and optimized SQL queries, the system efficiently processes more than 40 million taxi trip records.

The resulting architecture supports high-performance analytical queries and highlights the importance of schema design, indexing strategies, and structured ETL pipelines in real-world data engineering systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Data Engineering and SQL Analytics using PostgreSQL

Project Overview

Dataset Source

Dataset Attributes

Dataset Size

Python Dependencies

Repository Structure

Software Stack

Data Processing Pipeline

Phase 1 — Parquet to CSV Processing

Phase 2 — PostgreSQL ETL Pipeline

Database Architecture

Staging Layer

Analytics Layer

Query Optimization

Time Index

Location Index

Composite Index

Analytical Queries

Optimization over Raw Output

Scalability

Indexing

Performance Improvement

Analytical Insights

Engineering Competency Demonstrated

Lessons Learned

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
reports		reports
scripts		scripts
sql		sql
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Engineering and SQL Analytics using PostgreSQL

Project Overview

Dataset Source

Dataset Attributes

Dataset Size

Python Dependencies

Repository Structure

Software Stack

Data Processing Pipeline

Phase 1 — Parquet to CSV Processing

Phase 2 — PostgreSQL ETL Pipeline

Database Architecture

Staging Layer

Analytics Layer

Query Optimization

Time Index

Location Index

Composite Index

Analytical Queries

Optimization over Raw Output

Scalability

Indexing

Performance Improvement

Analytical Insights

Engineering Competency Demonstrated

Lessons Learned

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages