Skip to content

AliAzimiD/karchi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 

Repository files navigation


Karchi: A Comprehensive Data Pipeline Project

Overview

Karchi is a robust and scalable data pipeline project designed to handle complex data workflows using modern data engineering tools such as Apache Airflow, Apache Kafka, Apache Spark, and Apache Superset. The project is designed to be deployed on a Kubernetes cluster, ensuring high availability and scalability. This README provides an in-depth guide on the project's structure, functionality, and deployment process.

Table of Contents

  1. Project Structure
  2. Components Overview
  3. Installation and Deployment
  4. Usage
  5. Directory Structure
  6. Contributing
  7. License

Project Structure

The project is organized into several key directories and scripts, each serving a specific role in the data pipeline:

  • data-pipeline-project/ : Contains all the components and configurations necessary for the data pipeline.
    • dags/ : Airflow DAGs (Directed Acyclic Graphs) for orchestrating data workflows.
    • kafka/ : Configuration and scripts for Apache Kafka.
    • kubernetes/ : Kubernetes manifests for deploying the data pipeline components.
    • spark/ : Apache Spark jobs and related configurations.
    • superset/ : Apache Superset dashboards and configurations.
  • deploy.sh : A shell script for automating the deployment of the project on a Kubernetes cluster.
  • setup.ps1 : A PowerShell script for setting up the project environment on Windows.

Components Overview

Apache Airflow

Apache Airflow is used to manage and automate the workflows of the data pipeline. The dags/ directory contains the DAGs that define the tasks and their dependencies.

Apache Kafka

Apache Kafka is a distributed streaming platform used to build real-time data pipelines and streaming applications. The kafka/ directory contains the necessary configurations to set up Kafka brokers, topics, and consumers.

Apache Spark

Apache Spark is a powerful engine for large-scale data processing. The spark/ directory includes scripts and configurations for running Spark jobs that process data at scale.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform. The superset/ directory contains dashboards and configurations for visualizing the processed data.

Kubernetes

The project is designed to run on a Kubernetes cluster, which provides orchestration and management of containerized applications. The kubernetes/ directory includes Kubernetes manifests for deploying the data pipeline components.

Installation and Deployment

Prerequisites

Before deploying the project, ensure that you have the following tools installed:

  • Git: To clone the project repository.
  • Docker: To build and run containerized applications.
  • kubectl: Kubernetes command-line tool for managing the cluster.
  • Docker Compose: To manage multi-container Docker applications.

Deploying the Project

USE THIS LINE:

```bash
run bash <(curl -s https://raw.githubusercontent.com/AliAzimiD/karchi/master/deploy.sh)
```

OR

To deploy the project, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/AliAzimiD/karchi.git
    cd karchi
  2. Run the Deployment Script:

    The deploy.sh script automates the installation of required tools, the cloning of the repository, and the setup of Kubernetes and Superset. Run the script as follows:

    chmod +x deploy.sh
    ./deploy.sh

    This script will:

    • Install kubectl and Docker Compose if they are not already installed.
    • Clone the repository and set up the project directory.
    • Apply Kubernetes configurations to deploy the pipeline components.
    • Set up Apache Superset using Docker Compose.
  3. Accessing Services:

    After the deployment is complete, you can access the services via their respective ports. Superset, for example, will be accessible at http://localhost:8088.

Usage

Running Airflow DAGs

Airflow DAGs are located in the data-pipeline-project/dags/ directory. To trigger a DAG:

  1. Access the Airflow web interface (typically available at http://localhost:8080).
  2. Navigate to the "DAGs" tab and trigger the desired DAG.

Managing Kafka Topics

Kafka configurations are located in the data-pipeline-project/kafka/ directory. To create a new topic or manage existing ones, use the Kafka CLI or integrate it within your data pipeline workflows.

Running Spark Jobs

Spark jobs are defined in the data-pipeline-project/spark/ directory. You can submit these jobs to the Spark cluster either through the command line or by integrating them into Airflow DAGs.

Visualizing Data with Superset

Superset dashboards and configurations are stored in the data-pipeline-project/superset/ directory. After setting up Superset, access the web interface at http://localhost:8088 to create and manage your data visualizations.

Directory Structure

  • data-pipeline-project/
    • dags/ : Airflow DAGs
    • kafka/ : Kafka configurations
    • kubernetes/ : Kubernetes manifests
    • spark/ : Spark jobs
    • superset/ : Superset dashboards and configurations
  • deploy.sh : Deployment automation script
  • setup.ps1 : Windows setup script

Contributing

Contributions are welcome! If you have any ideas for improvements or new features, please open an issue or submit a pull request. Make sure to follow the project's coding standards and include relevant tests for new features.

License

This project is licensed under the MIT License. See the LICENSE file for more details.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors