Skip to content

guduri-data/aws-ecommerce-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AWS E-commerce Batch Data Engineering Pipeline

Overview

Built an end-to-end batch data pipeline on AWS using a Medallion Architecture (Raw → Silver → Gold).
The pipeline ingests raw e-commerce data, processes it using Spark on AWS Glue, and exposes analytics-ready tables using Amazon Athena.


Architecture

Architecture Diagram

Raw data → S3 → Glue (Spark ETL) → Parquet (Silver) → Athena CTAS → Gold tables


Tech Stack

  • Amazon S3 (Data Lake)
  • AWS Glue (Spark ETL, Crawlers)
  • Amazon Athena (SQL Analytics)
  • Apache Spark (PySpark)
  • SQL

Data Lake Structure

raw/ → source CSV files silver/ → cleaned Parquet data gold/ → analytics-ready tables

Gold Tables

  • dim_customers
  • dim_products
  • fact_orders

Sample Query

SELECT order_status, COUNT(*) 
FROM ecommerce_gold_db.fact_orders
GROUP BY order_status;

About

End-to-end AWS batch data engineering pipeline using S3, Glue, Spark, and Athena

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages