Karchi is a robust and scalable data pipeline project designed to handle complex data workflows using modern data engineering tools such as Apache Airflow, Apache Kafka, Apache Spark, and Apache Superset. The project is designed to be deployed on a Kubernetes cluster, ensuring high availability and scalability. This README provides an in-depth guide on the project's structure, functionality, and deployment process.
- Project Structure
- Components Overview
- Installation and Deployment
- Usage
- Directory Structure
- Contributing
- License
The project is organized into several key directories and scripts, each serving a specific role in the data pipeline:
data-pipeline-project/: Contains all the components and configurations necessary for the data pipeline.dags/: Airflow DAGs (Directed Acyclic Graphs) for orchestrating data workflows.kafka/: Configuration and scripts for Apache Kafka.kubernetes/: Kubernetes manifests for deploying the data pipeline components.spark/: Apache Spark jobs and related configurations.superset/: Apache Superset dashboards and configurations.
deploy.sh: A shell script for automating the deployment of the project on a Kubernetes cluster.setup.ps1: A PowerShell script for setting up the project environment on Windows.
Apache Airflow is used to manage and automate the workflows of the data pipeline. The dags/ directory contains the DAGs that define the tasks and their dependencies.
Apache Kafka is a distributed streaming platform used to build real-time data pipelines and streaming applications. The kafka/ directory contains the necessary configurations to set up Kafka brokers, topics, and consumers.
Apache Spark is a powerful engine for large-scale data processing. The spark/ directory includes scripts and configurations for running Spark jobs that process data at scale.
Apache Superset is an open-source data exploration and visualization platform. The superset/ directory contains dashboards and configurations for visualizing the processed data.
The project is designed to run on a Kubernetes cluster, which provides orchestration and management of containerized applications. The kubernetes/ directory includes Kubernetes manifests for deploying the data pipeline components.
Before deploying the project, ensure that you have the following tools installed:
- Git: To clone the project repository.
- Docker: To build and run containerized applications.
- kubectl: Kubernetes command-line tool for managing the cluster.
- Docker Compose: To manage multi-container Docker applications.
USE THIS LINE:
```bash
run bash <(curl -s https://raw.githubusercontent.com/AliAzimiD/karchi/master/deploy.sh)
```
OR
To deploy the project, follow these steps:
-
Clone the Repository:
git clone https://github.com/AliAzimiD/karchi.git cd karchi -
Run the Deployment Script:
The
deploy.shscript automates the installation of required tools, the cloning of the repository, and the setup of Kubernetes and Superset. Run the script as follows:chmod +x deploy.sh ./deploy.sh
This script will:
- Install
kubectland Docker Compose if they are not already installed. - Clone the repository and set up the project directory.
- Apply Kubernetes configurations to deploy the pipeline components.
- Set up Apache Superset using Docker Compose.
- Install
-
Accessing Services:
After the deployment is complete, you can access the services via their respective ports. Superset, for example, will be accessible at
http://localhost:8088.
Airflow DAGs are located in the data-pipeline-project/dags/ directory. To trigger a DAG:
- Access the Airflow web interface (typically available at
http://localhost:8080). - Navigate to the "DAGs" tab and trigger the desired DAG.
Kafka configurations are located in the data-pipeline-project/kafka/ directory. To create a new topic or manage existing ones, use the Kafka CLI or integrate it within your data pipeline workflows.
Spark jobs are defined in the data-pipeline-project/spark/ directory. You can submit these jobs to the Spark cluster either through the command line or by integrating them into Airflow DAGs.
Superset dashboards and configurations are stored in the data-pipeline-project/superset/ directory. After setting up Superset, access the web interface at http://localhost:8088 to create and manage your data visualizations.
data-pipeline-project/dags/: Airflow DAGskafka/: Kafka configurationskubernetes/: Kubernetes manifestsspark/: Spark jobssuperset/: Superset dashboards and configurations
deploy.sh: Deployment automation scriptsetup.ps1: Windows setup script
Contributions are welcome! If you have any ideas for improvements or new features, please open an issue or submit a pull request. Make sure to follow the project's coding standards and include relevant tests for new features.
This project is licensed under the MIT License. See the LICENSE file for more details.