Skip to content

ivszhuravlev/based-data-engineering-project

Repository files navigation

Building a simple End-to-End Data Engineering System

This project is based on the original repository: https://github.com/HamzaG737/data-engineering-project

I adapted and updated it to run reliably on Apple Silicon and with current Docker & dependency versions (Nov 2025). The original Airflow pipeline logic (API -> Kafka -> Spark -> Postgres) remains unchanged.

Key adjustments

Kafka

  • Uses the same KRaft architecture as the original repo.
  • Updated to ARM-compatible settings.
  • Added persistent volume kafka_data.
  • Minor config adjustments required for new Docker images (2025).

Spark

  • Added the required Kafka to Spark integration JAR:spark-sql-kafka-0-10_2.12-3.4.1.jar.
  • Added Postgres JDBC driver and psycopg2-binary.
  • Updated Dockerfile paths to match the official Apache Spark image.

Airflow

  • Updated Spark submit command.
  • Environment variables injected through .env file.
  • Fixed container log access (auto_remove=False) for debugging.

Docker Networking

  • Fixed my own fixes😂 Airflow communicates with Docker through the shared network and the Docker proxy as originally designed.

Postgres

  • Replaced pgAdmin with TablePlus for better Mac compatibility.

Apple Silicon Support

  • Replaced incompatible Kafka & Spark images with ARM-compatible nversions.
  • Skipped pgAdmin4 (too slow on M1) and switched to TablePlus for database management.

About

Based on the original data-engineering project by HamzaG737. Adapted and fixed for Apple M1 and updated to work with current Docker and dependency versions (Nov 2025).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors