Built an end-to-end batch data pipeline on AWS using a Medallion Architecture (Raw → Silver → Gold).
The pipeline ingests raw e-commerce data, processes it using Spark on AWS Glue, and exposes analytics-ready tables using Amazon Athena.
Raw data → S3 → Glue (Spark ETL) → Parquet (Silver) → Athena CTAS → Gold tables
- Amazon S3 (Data Lake)
- AWS Glue (Spark ETL, Crawlers)
- Amazon Athena (SQL Analytics)
- Apache Spark (PySpark)
- SQL
- dim_customers
- dim_products
- fact_orders
SELECT order_status, COUNT(*)
FROM ecommerce_gold_db.fact_orders
GROUP BY order_status;