End to end ELT pipeline that consolidates data from 3 live APIs (GitHub, CoinGecko, Hacker News) into Snowflake and transforms it with dbt.
| Area | What this project does |
|---|---|
| Sources | GitHub, CoinGecko, and Hacker News live APIs |
| Pipeline | Airflow extracts -> S3 raw landing -> Snowflake COPY INTO -> dbt models |
| Cloud stack | Airflow, S3, Snowflake, dbt, Docker Compose |
| Data shape | Date-partitioned raw JSON with warehouse staging and analytics layers |
| Orchestration | Parallel API extraction with downstream load and transform steps |
| Warehouse | Snowflake RAW and ANALYTICS schemas |
| Modeling | dbt staging views plus mart-level ingestion analytics |
| Deployment style | Reproducible local environment via Docker Compose |
- Orchestration with Airflow (parallel extracts -> load -> dbt)
- Data lake landing zone on S3 (date-partitioned raw JSON)
- Cloud warehouse ingestion with Snowflake
COPY INTOfrom an external stage - Modeling with dbt (staging views + mart table)
- Reproducible local environment via Docker Compose
APIs -> S3 (raw landing zone) -> Snowflake (COPY INTO) -> dbt -> Analytics
flowchart LR
GH["GitHub API"] --> AF["Airflow DAG"]
CG["CoinGecko API"] --> AF
HN["Hacker News API"] --> AF
AF --> S3["S3 bucket (raw landing zone)"]
S3 --> SFRAW["Snowflake RAW schema (COPY INTO)"]
SFRAW --> DBT["dbt transformations"]
DBT --> SFA["Snowflake ANALYTICS schema"]
- Airflow extracts from APIs in parallel.
- Each extractor uploads raw JSON to S3, partitioned by run date.
- A load task uses Snowflake
COPY INTOfrom an external S3 stage into theRAWschema. - dbt transforms raw data into staging and marts models.
s3://<bucket>/
raw/
github/YYYY-MM-DD/repos.json
coingecko/YYYY-MM-DD/markets.json
coingecko/YYYY-MM-DD/price_history.json
hackernews/YYYY-MM-DD/posts.json
All credentials are read from environment variables (no hardcoding).
GITHUB_TOKENSNOWFLAKE_ACCOUNT,SNOWFLAKE_USER,SNOWFLAKE_PASSWORD,SNOWFLAKE_DATABASE,SNOWFLAKE_WAREHOUSEAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,S3_BUCKET_NAME
-
Create
.envbased on.env.exampleand fill in your values. -
Create
dbt_project/profiles.ymlbased ondbt_project/profiles.yml.example(uses env vars for auth). -
Start Airflow with Docker Compose:
docker compose up -d --build
-
Open Airflow UI at
http://localhost:8080and log in withairflow/airflow. -
Trigger DAG
dev_ecosystem_elt.
Snowflake (RAW counts):
select 'github' as source, count(*) from dev_ecosystem.raw.raw_github_repos
union all
select 'coingecko_markets', count(*) from dev_ecosystem.raw.raw_coingecko_markets
union all
select 'coingecko_price_history', count(*) from dev_ecosystem.raw.raw_coingecko_price_history
union all
select 'community_posts', count(*) from dev_ecosystem.raw.raw_reddit_posts;Snowflake (mart table):
select * from dev_ecosystem.analytics.fct_ingestion_counts order by source_name;dbt is preinstalled in the Airflow container image for this repo, so you can run dbt via Docker:
docker compose exec airflow-webserver bash -lc "cd /opt/airflow/dbt_project && dbt debug"
docker compose exec airflow-webserver bash -lc "cd /opt/airflow/dbt_project && dbt run --select path:models/staging"
docker compose exec airflow-webserver bash -lc "cd /opt/airflow/dbt_project && dbt test"
docker compose exec airflow-webserver bash -lc "cd /opt/airflow/dbt_project && dbt run --select path:models/marts"Note: if dbt test prints "Nothing to do", it means no tests are defined in this project yet.
- Extractors log each S3 upload and the S3 key written.
- Snowflake loading uses
COPY INTOwith a JSON file format usingSTRIP_OUTER_ARRAY = TRUE. - This project keeps the Snowflake table name
RAW.RAW_REDDIT_POSTSand dbt modelstg_reddit_postsfor compatibility, but the source data is currently Hacker News posts (open API).
See docs/SCREENSHOTS.md for exactly what to capture and commit under docs/screenshots/.
Deep dive docs:
docs/PROJECT_WALKTHROUGH.md(how everything connects)docs/GLOSSARY.md(definitions of key terms)docs/INTERVIEW_PREP.md(question bank + talking points)
Airflow DAG (end-to-end run):
Airflow log proof (S3 upload):
S3 landing zone layout:
S3 date partitions (example source):
Snowflake RAW counts:
Snowflake analytics mart (dbt output):





