This project is based on the original repository: https://github.com/HamzaG737/data-engineering-project
I adapted and updated it to run reliably on Apple Silicon and with current Docker & dependency versions (Nov 2025). The original Airflow pipeline logic (API -> Kafka -> Spark -> Postgres) remains unchanged.
- Uses the same KRaft architecture as the original repo.
- Updated to ARM-compatible settings.
- Added persistent volume kafka_data.
- Minor config adjustments required for new Docker images (2025).
- Added the required Kafka to Spark integration JAR:spark-sql-kafka-0-10_2.12-3.4.1.jar.
- Added Postgres JDBC driver and psycopg2-binary.
- Updated Dockerfile paths to match the official Apache Spark image.
- Updated Spark submit command.
- Environment variables injected through .env file.
- Fixed container log access (auto_remove=False) for debugging.
- Fixed my own fixes😂 Airflow communicates with Docker through the shared network and the Docker proxy as originally designed.
- Replaced pgAdmin with TablePlus for better Mac compatibility.
- Replaced incompatible Kafka & Spark images with ARM-compatible nversions.
- Skipped pgAdmin4 (too slow on M1) and switched to TablePlus for database management.