AWS E-commerce Batch Data Engineering Pipeline

Overview

Built an end-to-end batch data pipeline on AWS using a Medallion Architecture (Raw → Silver → Gold).
The pipeline ingests raw e-commerce data, processes it using Spark on AWS Glue, and exposes analytics-ready tables using Amazon Athena.

Architecture

Raw data → S3 → Glue (Spark ETL) → Parquet (Silver) → Athena CTAS → Gold tables

Tech Stack

Amazon S3 (Data Lake)
AWS Glue (Spark ETL, Crawlers)
Amazon Athena (SQL Analytics)
Apache Spark (PySpark)
SQL

Data Lake Structure

raw/ → source CSV files silver/ → cleaned Parquet data gold/ → analytics-ready tables

Gold Tables

dim_customers
dim_products
fact_orders

Sample Query

SELECT order_status, COUNT(*) 
FROM ecommerce_gold_db.fact_orders
GROUP BY order_status;

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs/architecture		docs/architecture
glue		glue
screenshots		screenshots
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS E-commerce Batch Data Engineering Pipeline

Overview

Architecture

Tech Stack

Data Lake Structure

raw/ → source CSV files silver/ → cleaned Parquet data gold/ → analytics-ready tables

Gold Tables

Sample Query

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AWS E-commerce Batch Data Engineering Pipeline

Overview

Architecture

Tech Stack

Data Lake Structure

raw/ → source CSV files silver/ → cleaned Parquet data gold/ → analytics-ready tables

Gold Tables

Sample Query

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages