This project automates and monitors a data warehouse ETL pipeline for a music streaming startup, Sparkify, using Apache Airflow. The pipeline extracts JSON logs and song metadata from AWS S3, stages them in Amazon Redshift, and transforms them into a star schema for analytical purposes.
The pipeline follows a specific dependency flow to ensure data integrity:
- Start:
Begin_executiondummy task. - Staging:
Stage_eventsandStage_songsrun in parallel to load S3 data into Redshift. - Fact Loading:
Load_songplays_fact_tableprocesses staged data. - Dimension Loading: Four tasks (
Load_user_dim_table, etc.) run in parallel to populate the star schema. - Quality Control:
Run_data_quality_checksvalidates the final datasets. - End:
Stop_executiontask.
- Custom Stage Operator: Uses a
COPYstatement to load JSON files from S3, with support for templated fields to handle backfills based on execution time. - Flexible Dimension Loading: Implements a "truncate-insert" pattern with a parameter to toggle between append-only and delete-load functionality.
- Automated Quality Checks: A parameterized operator that runs SQL test cases and compares results against expected values, raising exceptions on failure.
- Robust Configuration: DAG is configured with 3 retries, 5-minute delays, and no catchup for efficiency.
Data-Pipelines-Airflow
├── dags
│ └── final_project.py # Main DAG definition with task dependencies
├── plugins
│ ├── helpers
│ │ └── sql_queries.py # Provided SQL transformations
│ └── operators
│ ├── stage_redshift.py # S3 to Redshift staging logic
│ ├── load_fact.py # Fact table loading logic
│ ├── load_dimension.py # Dimension table loading logic
│ └── data_quality.py # Data validation logic
├── create_tables.sql # SQL DDL for Redshift schema
└── README.md
🚀 How to Run
Prerequisites: Create an IAM user and a Redshift Serverless workgroup in AWS.
S3 Setup: Copy the source data from s3://udacity-dend/ to your own S3 bucket using the AWS CLI.
Airflow Connections: Define aws_credentials and redshift connections in the Airflow UI.
Schema: Run the queries in create_tables.sql in your Redshift editor.
Trigger: Turn the DAG on in the Airflow Web Server (localhost:8080).