A no-code ETL (Extract, Transform, Load) framework with visual pipeline designer that simulates the capabilities of enterprise tools like PySpark, Apache Airflow, and Apache Flink.
- Drag & Drop Interface: Intuitive visual pipeline designer
- Component Toolbox: Pre-built data sources, transformations, and destinations
- Real-time Canvas: Interactive pipeline visualization with connections
- YAML Configuration: Advanced configuration through YAML editor
- Oracle Database: Enterprise database connectivity
- MongoDB: NoSQL document database support
- Apache Hive: Big data warehouse integration
- PostgreSQL & MySQL: Relational database support
- Connection Testing: Built-in connectivity validation
- Real-time Monitoring: Live job status and progress tracking
- Execution Logs: Detailed logging and error reporting
- Performance Metrics: Job duration and success rate analytics
- Status Management: Queue, run, pause, and stop operations
- Cron-based Scheduling: Flexible time-based automation
- DAG Visualization: Directed Acyclic Graph representation
- Schedule Management: Enable, disable, and modify schedules
- Dependency Tracking: Task dependency management
- System Overview: Key metrics and statistics
- Active Jobs: Real-time pipeline execution status
- Data Processing: Volume and performance tracking
- Historical Reports: Success rates and trends
- Frontend: React 18, TypeScript, Tailwind CSS
- Backend: Node.js, Express, TypeScript
- Database: In-memory storage (extensible to PostgreSQL)
- Build Tools: Vite, ESBuild
- UI Components: Radix UI, Shadcn/ui
- State Management: TanStack Query
- Node.js 20.x or higher
- npm or yarn package manager
-
Clone the repository
git clone https://github.com/your-username/dataflow-studio.git cd dataflow-studio -
Install dependencies
npm install
-
Start the development server
npm run dev
-
Access the application Open your browser to
http://localhost:5000
dataflow-studio/
βββ client/ # React frontend application
β βββ src/
β β βββ components/ # Reusable UI components
β β β βββ connectors/ # Data connector components
β β β βββ jobs/ # Job monitoring components
β β β βββ layout/ # Layout components
β β β βββ pipeline/ # Pipeline builder components
β β β βββ scheduler/ # Scheduler components
β β β βββ ui/ # Base UI components
β β βββ hooks/ # Custom React hooks
β β βββ lib/ # Utility libraries
β β βββ pages/ # Application pages
βββ server/ # Node.js backend
β βββ index.ts # Server entry point
β βββ routes.ts # API routes
β βββ storage.ts # Data storage layer
β βββ vite.ts # Vite integration
βββ shared/ # Shared types and schemas
β βββ schema.ts # Database schema definitions
βββ docs/ # Documentation
-
Navigate to Pipeline Builder
- Click on "Pipeline Builder" in the sidebar
- Start with a blank canvas
-
Add Data Sources
- Drag Oracle, MongoDB, or Hive components from the toolbox
- Configure connection parameters
-
Add Transformations
- Add Filter, Join, or Aggregate components
- Define transformation logic in YAML
-
Add Destinations
- Configure data warehouse or file export targets
- Set up output parameters
-
Save and Execute
- Save your pipeline configuration
- Click "Run" to execute immediately or schedule for later
-
Go to Connectors Page
- Click "Add Connector" button
- Select your database type
-
Configure Connection
- Enter host, port, database credentials
- Test the connection
-
Use in Pipelines
- Reference connectors in your pipeline configurations
- Data sources automatically use configured connections
-
Job Monitor Dashboard
- View all running and completed jobs
- Monitor progress and performance
-
Real-time Updates
- Jobs refresh automatically every 5 seconds
- View detailed logs and error messages
-
Create Schedule
- Select a pipeline to schedule
- Choose from predefined cron patterns
-
Manage Schedules
- Enable/disable schedules
- View next run times and history
Create a .env file in the root directory:
NODE_ENV=development
PORT=5000
DATABASE_URL=your_database_connection_stringExample pipeline configuration:
transformations:
- name: "customer_cleansing"
type: "data_quality"
rules:
- field: "email"
validation: "email_format"
- field: "phone"
standardize: "e164_format"
sources:
oracle_orders:
connection: "prod_oracle"
query: "SELECT * FROM customers WHERE created_date >= '2024-01-01'"
targets:
hive_warehouse:
table: "analytics.customers_clean"
mode: "append"We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow TypeScript best practices
- Write comprehensive tests
- Update documentation for new features
- Follow the existing code style
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by enterprise ETL tools like Talend and AbInitio
- Built with modern web technologies
- Designed for ease of use and scalability
- Create an issue for bug reports
- Start a discussion for feature requests
- Check the documentation for common questions
DataFlow Studio - Making ETL accessible to everyone, from data engineers to business analysts.