Mini Data Warehouse is a comprehensive, production-ready data warehouse demonstration project that showcases modern data engineering practices and technologies. Built with PostgreSQL, Docker, and Python, this project provides a complete end-to-end data warehousing solution with synthetic e-commerce data.
This is an educational and practical implementation of a data warehouse system that demonstrates:
- Data Warehouse Architecture: Traditional star schema design with fact and dimension tables
- ETL Pipeline Development: Complete Extract, Transform, Load processes with data quality controls
- Synthetic Data Generation: Realistic e-commerce dataset using Python Faker library
- Analytics & Reporting: Interactive dashboards and pre-built analytical queries
- Data Governance: Quality checks, retention policies, and incremental loading strategies
- PostgreSQL 15 database as the analytical data store
- Docker containerization for easy deployment and portability
- PgAdmin web interface for database administration
- Star schema implementation with optimized fact and dimension tables
- 200+ customers with realistic profiles and geographic distribution
- 50+ products across 5 categories (Electronics, Furniture, Clothing, Books, Sports)
- 1000+ orders with variable order items and realistic date ranges
- Global customer base with diverse countries and cities
- Data Transformation Pipelines: Customer enrichment, product performance analytics
- Incremental Loading: Support for ongoing data updates and change data capture
- Data Quality Management: Automated checks for completeness, integrity, and consistency
- Retention Policies: Configurable data lifecycle management with archival
- Interactive Dashboard: HTML-based dashboard with real-time metrics and charts
- Star Schema Queries: Pre-built analytical queries for common business questions
- Performance Metrics: Customer segmentation, product performance, revenue trends
- Time-series Analysis: Daily, monthly, and quarterly aggregations
- Unified CLI:
warehouse_manager.pyfor all warehouse operations - Pipeline Automation: One-command execution of complete ETL workflows
- Status Monitoring: Real-time health checks and data statistics
- Maintenance Tools: Cleanup, backup, and service management utilities
- Database: PostgreSQL 15
- Container Platform: Docker & Docker Compose
- Programming Language: Python 3.x
- Data Processing: Pandas library
- Data Generation: Faker library
- Package Manager: Nix (with flake support)
- Environment Management: direnv
- Database Admin: PgAdmin 4
- Version Control: Git
This project is ideal for:
- Learning Data Warehousing: Understand star schema design, ETL processes, and analytical queries
- Prototyping Analytics Solutions: Quick setup for testing BI tools and data visualization platforms
- Teaching Data Engineering: Comprehensive example for educational purposes
- Development & Testing: Realistic dataset for application development and testing
- Portfolio Projects: Demonstrate data engineering skills and best practices
- Data Generation → Synthetic CSV files created with realistic patterns
- Data Loading → CSV files imported into PostgreSQL staging tables
- Data Transformation → ETL pipelines enrich and validate data
- Star Schema Population → Dimension and fact tables populated
- Analytics & Reporting → Dashboards and queries consume warehouse data
- Raw Tables:
customers,products,orders,order_items - Dimension Tables:
dim_customer,dim_product,dim_date - Fact Tables:
fact_saleswith comprehensive sales metrics - Aggregation Views: Pre-computed metrics for performance
The quickest way to get started:
# Run complete pipeline in one command
python warehouse_manager.py pipelineThis single command will:
- Generate synthetic data
- Start database services
- Create star schema
- Run transformations
- Execute quality checks
- Generate dashboard
This project aims to:
- Demonstrate Best Practices: Show modern data warehouse implementation patterns
- Provide Hands-on Learning: Offer a complete, working example for study and experimentation
- Enable Quick Prototyping: Allow rapid setup of analytical environments
- Showcase Technologies: Integrate popular data engineering tools and frameworks
- Promote Data Quality: Emphasize data governance and quality management
- Data Engineers: Learn warehouse design and ETL implementation
- Data Analysts: Understand data structures and analytical patterns
- Students: Study real-world data engineering projects
- Developers: Quick analytics setup for application development
- Data Scientists: Realistic dataset for algorithm testing and experimentation
This project is open source and available under the MIT License, making it free to use, modify, and distribute for both personal and commercial purposes.
Contributions are welcome! This project serves as both a learning tool and a practical implementation, and improvements or new features that enhance either aspect are encouraged.
Mini Data Warehouse - A complete, containerized data warehouse solution demonstrating modern data engineering practices with PostgreSQL, Docker, and Python.