Skip to content

SankarPatnaik/dataflow-studio

Repository files navigation

DataFlow Studio

A no-code ETL (Extract, Transform, Load) framework with visual pipeline designer that simulates the capabilities of enterprise tools like PySpark, Apache Airflow, and Apache Flink.

DataFlow Studio Dashboard Tech Stack License

πŸš€ Features

Visual Pipeline Builder

  • Drag & Drop Interface: Intuitive visual pipeline designer
  • Component Toolbox: Pre-built data sources, transformations, and destinations
  • Real-time Canvas: Interactive pipeline visualization with connections
  • YAML Configuration: Advanced configuration through YAML editor

Data Connectors

  • Oracle Database: Enterprise database connectivity
  • MongoDB: NoSQL document database support
  • Apache Hive: Big data warehouse integration
  • PostgreSQL & MySQL: Relational database support
  • Connection Testing: Built-in connectivity validation

Job Monitoring & Execution

  • Real-time Monitoring: Live job status and progress tracking
  • Execution Logs: Detailed logging and error reporting
  • Performance Metrics: Job duration and success rate analytics
  • Status Management: Queue, run, pause, and stop operations

Pipeline Scheduler

  • Cron-based Scheduling: Flexible time-based automation
  • DAG Visualization: Directed Acyclic Graph representation
  • Schedule Management: Enable, disable, and modify schedules
  • Dependency Tracking: Task dependency management

Dashboard & Analytics

  • System Overview: Key metrics and statistics
  • Active Jobs: Real-time pipeline execution status
  • Data Processing: Volume and performance tracking
  • Historical Reports: Success rates and trends

πŸ› οΈ Technology Stack

  • Frontend: React 18, TypeScript, Tailwind CSS
  • Backend: Node.js, Express, TypeScript
  • Database: In-memory storage (extensible to PostgreSQL)
  • Build Tools: Vite, ESBuild
  • UI Components: Radix UI, Shadcn/ui
  • State Management: TanStack Query

πŸ“¦ Installation

Prerequisites

  • Node.js 20.x or higher
  • npm or yarn package manager

Quick Start

  1. Clone the repository

    git clone https://github.com/your-username/dataflow-studio.git
    cd dataflow-studio
  2. Install dependencies

    npm install
  3. Start the development server

    npm run dev
  4. Access the application Open your browser to http://localhost:5000

πŸ—οΈ Project Structure

dataflow-studio/
β”œβ”€β”€ client/                 # React frontend application
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/     # Reusable UI components
β”‚   β”‚   β”‚   β”œβ”€β”€ connectors/ # Data connector components
β”‚   β”‚   β”‚   β”œβ”€β”€ jobs/       # Job monitoring components
β”‚   β”‚   β”‚   β”œβ”€β”€ layout/     # Layout components
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline/   # Pipeline builder components
β”‚   β”‚   β”‚   β”œβ”€β”€ scheduler/  # Scheduler components
β”‚   β”‚   β”‚   └── ui/         # Base UI components
β”‚   β”‚   β”œβ”€β”€ hooks/          # Custom React hooks
β”‚   β”‚   β”œβ”€β”€ lib/            # Utility libraries
β”‚   β”‚   └── pages/          # Application pages
β”œβ”€β”€ server/                 # Node.js backend
β”‚   β”œβ”€β”€ index.ts           # Server entry point
β”‚   β”œβ”€β”€ routes.ts          # API routes
β”‚   β”œβ”€β”€ storage.ts         # Data storage layer
β”‚   └── vite.ts            # Vite integration
β”œβ”€β”€ shared/                # Shared types and schemas
β”‚   └── schema.ts          # Database schema definitions
└── docs/                  # Documentation

🎯 Usage Guide

Creating Your First Pipeline

  1. Navigate to Pipeline Builder

    • Click on "Pipeline Builder" in the sidebar
    • Start with a blank canvas
  2. Add Data Sources

    • Drag Oracle, MongoDB, or Hive components from the toolbox
    • Configure connection parameters
  3. Add Transformations

    • Add Filter, Join, or Aggregate components
    • Define transformation logic in YAML
  4. Add Destinations

    • Configure data warehouse or file export targets
    • Set up output parameters
  5. Save and Execute

    • Save your pipeline configuration
    • Click "Run" to execute immediately or schedule for later

Setting Up Data Connectors

  1. Go to Connectors Page

    • Click "Add Connector" button
    • Select your database type
  2. Configure Connection

    • Enter host, port, database credentials
    • Test the connection
  3. Use in Pipelines

    • Reference connectors in your pipeline configurations
    • Data sources automatically use configured connections

Monitoring Jobs

  1. Job Monitor Dashboard

    • View all running and completed jobs
    • Monitor progress and performance
  2. Real-time Updates

    • Jobs refresh automatically every 5 seconds
    • View detailed logs and error messages

Scheduling Pipelines

  1. Create Schedule

    • Select a pipeline to schedule
    • Choose from predefined cron patterns
  2. Manage Schedules

    • Enable/disable schedules
    • View next run times and history

πŸ”§ Configuration

Environment Variables

Create a .env file in the root directory:

NODE_ENV=development
PORT=5000
DATABASE_URL=your_database_connection_string

YAML Pipeline Configuration

Example pipeline configuration:

transformations:
  - name: "customer_cleansing"
    type: "data_quality"
    rules:
      - field: "email"
        validation: "email_format"
      - field: "phone"
        standardize: "e164_format"

sources:
  oracle_orders:
    connection: "prod_oracle"
    query: "SELECT * FROM customers WHERE created_date >= '2024-01-01'"
    
targets:
  hive_warehouse:
    table: "analytics.customers_clean"
    mode: "append"

🀝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow TypeScript best practices
  • Write comprehensive tests
  • Update documentation for new features
  • Follow the existing code style

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸŽ–οΈ Acknowledgments

  • Inspired by enterprise ETL tools like Talend and AbInitio
  • Built with modern web technologies
  • Designed for ease of use and scalability

πŸ“ž Support

  • Create an issue for bug reports
  • Start a discussion for feature requests
  • Check the documentation for common questions

DataFlow Studio - Making ETL accessible to everyone, from data engineers to business analysts.

About

A no-code ETL framework with visual pipeline designer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages