Skip to content

Latest commit

 

History

History
134 lines (99 loc) · 5.83 KB

File metadata and controls

134 lines (99 loc) · 5.83 KB

About Mini Data Warehouse

Project Description

Mini Data Warehouse is a comprehensive, production-ready data warehouse demonstration project that showcases modern data engineering practices and technologies. Built with PostgreSQL, Docker, and Python, this project provides a complete end-to-end data warehousing solution with synthetic e-commerce data.

What is This Project?

This is an educational and practical implementation of a data warehouse system that demonstrates:

  • Data Warehouse Architecture: Traditional star schema design with fact and dimension tables
  • ETL Pipeline Development: Complete Extract, Transform, Load processes with data quality controls
  • Synthetic Data Generation: Realistic e-commerce dataset using Python Faker library
  • Analytics & Reporting: Interactive dashboards and pre-built analytical queries
  • Data Governance: Quality checks, retention policies, and incremental loading strategies

Key Features

🏗️ Complete Data Warehouse Infrastructure

  • PostgreSQL 15 database as the analytical data store
  • Docker containerization for easy deployment and portability
  • PgAdmin web interface for database administration
  • Star schema implementation with optimized fact and dimension tables

📊 Synthetic E-Commerce Data

  • 200+ customers with realistic profiles and geographic distribution
  • 50+ products across 5 categories (Electronics, Furniture, Clothing, Books, Sports)
  • 1000+ orders with variable order items and realistic date ranges
  • Global customer base with diverse countries and cities

🔄 Advanced ETL Capabilities

  • Data Transformation Pipelines: Customer enrichment, product performance analytics
  • Incremental Loading: Support for ongoing data updates and change data capture
  • Data Quality Management: Automated checks for completeness, integrity, and consistency
  • Retention Policies: Configurable data lifecycle management with archival

📈 Analytics & Visualization

  • Interactive Dashboard: HTML-based dashboard with real-time metrics and charts
  • Star Schema Queries: Pre-built analytical queries for common business questions
  • Performance Metrics: Customer segmentation, product performance, revenue trends
  • Time-series Analysis: Daily, monthly, and quarterly aggregations

🛠️ Management & Operations

  • Unified CLI: warehouse_manager.py for all warehouse operations
  • Pipeline Automation: One-command execution of complete ETL workflows
  • Status Monitoring: Real-time health checks and data statistics
  • Maintenance Tools: Cleanup, backup, and service management utilities

Technology Stack

Core Technologies

  • Database: PostgreSQL 15
  • Container Platform: Docker & Docker Compose
  • Programming Language: Python 3.x
  • Data Processing: Pandas library
  • Data Generation: Faker library

Development Tools

  • Package Manager: Nix (with flake support)
  • Environment Management: direnv
  • Database Admin: PgAdmin 4
  • Version Control: Git

Use Cases

This project is ideal for:

  1. Learning Data Warehousing: Understand star schema design, ETL processes, and analytical queries
  2. Prototyping Analytics Solutions: Quick setup for testing BI tools and data visualization platforms
  3. Teaching Data Engineering: Comprehensive example for educational purposes
  4. Development & Testing: Realistic dataset for application development and testing
  5. Portfolio Projects: Demonstrate data engineering skills and best practices

Project Architecture

Data Flow

  1. Data Generation → Synthetic CSV files created with realistic patterns
  2. Data Loading → CSV files imported into PostgreSQL staging tables
  3. Data Transformation → ETL pipelines enrich and validate data
  4. Star Schema Population → Dimension and fact tables populated
  5. Analytics & Reporting → Dashboards and queries consume warehouse data

Database Schema

  • Raw Tables: customers, products, orders, order_items
  • Dimension Tables: dim_customer, dim_product, dim_date
  • Fact Tables: fact_sales with comprehensive sales metrics
  • Aggregation Views: Pre-computed metrics for performance

Getting Started

The quickest way to get started:

# Run complete pipeline in one command
python warehouse_manager.py pipeline

This single command will:

  • Generate synthetic data
  • Start database services
  • Create star schema
  • Run transformations
  • Execute quality checks
  • Generate dashboard

Project Goals

This project aims to:

  • Demonstrate Best Practices: Show modern data warehouse implementation patterns
  • Provide Hands-on Learning: Offer a complete, working example for study and experimentation
  • Enable Quick Prototyping: Allow rapid setup of analytical environments
  • Showcase Technologies: Integrate popular data engineering tools and frameworks
  • Promote Data Quality: Emphasize data governance and quality management

Target Audience

  • Data Engineers: Learn warehouse design and ETL implementation
  • Data Analysts: Understand data structures and analytical patterns
  • Students: Study real-world data engineering projects
  • Developers: Quick analytics setup for application development
  • Data Scientists: Realistic dataset for algorithm testing and experimentation

Open Source

This project is open source and available under the MIT License, making it free to use, modify, and distribute for both personal and commercial purposes.

Contributing

Contributions are welcome! This project serves as both a learning tool and a practical implementation, and improvements or new features that enhance either aspect are encouraged.


Mini Data Warehouse - A complete, containerized data warehouse solution demonstrating modern data engineering practices with PostgreSQL, Docker, and Python.