# Learning Path - Recommended Order for Labs This guide suggests a structured learning path through the Apache Iceberg Practice Environment labs. While you can skip around, following this path ensures you build skills progressively. ## Overview The learning path is divided into three levels: - **Beginner** (0-6 months experience): Foundation skills - **Intermediate** (6-18 months experience): Production patterns - **Advanced** (18+ months experience): System design and optimization ## Beginner Path ### Goal: Build foundation in Iceberg and basic data engineering ### Week 1: Environment Setup and Fundamentals #### Lab 0: Sample Database Setup (30-45 min) - Generate and load realistic business data - Explore sample database schema and relationships - Practice basic SQL queries - Understand data distribution and patterns **Why start here**: Provides realistic data for all subsequent labs. No Iceberg knowledge required. #### Lab 1: Environment Setup (30-45 min) - Verify all components are running - Test catalog connectivity - Validate storage access - Perform first Iceberg operation **Why it's important**: Ensures your environment works before diving into complex concepts. #### Lab 2: Basic Iceberg Operations (30-45 min) - Create Iceberg tables - Insert and query data - Understand schema evolution basics - Practice CRUD operations **Why it matters**: Core Iceberg concepts you'll use in every lab. **Beginner Milestone**: Complete Labs 0-2 (approximately 2 hours) ## Intermediate Path ### Goal: Master production patterns and optimization ### Week 2: Advanced Iceberg Features #### Lab 3: Advanced Features (45-60 min) - Partitioning strategies - Time travel queries - Schema evolution with migrations - Understanding snapshots and metadata **Why it's important**: These features distinguish Iceberg from traditional data lakes. #### Lab 4: Spark Optimizations (45-60 min) - File compaction - Snapshot management - Query planning optimization - Understanding metadata-only queries **Why it matters**: Performance is critical in production environments. ### Week 3: Real-World Patterns #### Lab 5: Real-World Patterns (45-60 min) - Slowly Changing Dimensions (SCD) - Upsert operations - Batch and streaming patterns - Star schema implementation **Why it's important**: These patterns are used in real data engineering projects. **Intermediate Milestone**: Complete Labs 3-5 (approximately 3 hours) ## Advanced Path ### Goal: System design, streaming, and multi-engine architecture ### Week 4: Performance and Operations #### Lab 6: Performance & UI (60-90 min) - Complex Iceberg join operations - Spark History Server UI exploration - DAG inspection and metadata-only filtering - Performance analysis and optimization **Why it's important**: Understanding query execution helps optimize production workloads. #### Lab 7: Table Maintenance (60-90 min) - File compaction and optimization strategies - Snapshot management and expiration - Orphan file cleanup and storage reclamation - Table statistics collection and analysis - Metadata optimization - Monitoring and alerting setup **Why it's important**: Maintenance is crucial for long-term production systems. ### Week 5: Streaming and CDC #### Lab 8: Kafka Integration (60-90 min) - Set up Apache Kafka for real-time data streaming - Produce and consume events with Kafka - Integrate Spark Structured Streaming with Iceberg - Implement real-time analytics on streaming data - Handle exactly-once processing semantics **Why it's important**: Streaming is essential for modern data architectures. #### Lab 9: Real CDC with Debezium (60-90 min) - Configure Debezium for MySQL CDC - Set up MySQL for change data capture - Create and manage Debezium connectors - Stream CDC events to Kafka topics - Consume CDC events with Spark Structured Streaming - Apply CDC changes to Iceberg tables **Why it's important**: CDC enables real-time data synchronization across systems. ### Week 6: Application Integration and Multi-Engine #### Lab 10: Spring Boot with Iceberg (60-90 min) - Create Spring Boot applications with Iceberg integration - Configure Iceberg catalog and table access - Implement CRUD operations on Iceberg tables - Build REST APIs for Iceberg data access - Implement transaction handling and error management **Why it's important**: Applications need to interact with data lakes efficiently. #### Lab 11: Multi-Engine Lakehouse (60-90 min) - Configure multiple query engines (Spark, Trino, DuckDB) - Ensure schema consistency across engines - Implement engine-specific optimizations - Handle data type conversions between engines - Monitor and optimize multi-engine workloads **Why it's important**: Modern lakehouses use multiple engines for different use cases. **Advanced Milestone**: Complete Labs 6-11 (approximately 8 hours) ## Alternative Learning Paths ### Fast Track for Experienced Engineers If you have 2+ years of data engineering experience: 1. **Skip Labs 0-1**: Assume environment works 2. **Lab 2**: Quick refresher on Iceberg basics (30 min) 3. **Labs 3-5**: Focus on advanced features (2 hours) 4. **Labs 6-7**: Performance and operations (2 hours) 5. **Choose specialization**: Either streaming (8-9) or applications (10-11) **Total time**: 5-6 hours ### Streaming Specialist Path Focus on real-time data processing: 1. **Labs 0-2**: Foundation (2 hours) 2. **Lab 5**: Real-world patterns (1 hour) 3. **Lab 8**: Kafka integration (1.5 hours) 4. **Lab 9**: CDC with Debezium (1.5 hours) 5. **Lab 11**: Multi-engine considerations (1.5 hours) **Total time**: 7.5 hours ### Performance Engineer Path Focus on optimization and operations: 1. **Labs 0-2**: Foundation (2 hours) 2. **Lab 3**: Advanced features (1 hour) 3. **Lab 4**: Spark optimizations (1 hour) 4. **Lab 6**: Performance & UI (1.5 hours) 5. **Lab 7**: Table maintenance (1.5 hours) 6. **Lab 11**: Multi-engine optimization (1.5 hours) **Total time**: 8.5 hours ### Application Developer Path Focus on building applications with Iceberg: 1. **Labs 0-2**: Foundation (2 hours) 2. **Lab 5**: Real-world patterns (1 hour) 3. **Lab 10**: Spring Boot integration (1.5 hours) 4. **Lab 11**: Multi-engine considerations (1.5 hours) **Total time**: 6 hours ## Progress Tracking Use this checklist to track your progress: ### Beginner - [ ] Lab 0: Sample Database Setup - [ ] Lab 1: Environment Setup - [ ] Lab 2: Basic Iceberg Operations ### Intermediate - [ ] Lab 3: Advanced Features - [ ] Lab 4: Spark Optimizations - [ ] Lab 5: Real-World Patterns ### Advanced - [ ] Lab 6: Performance & UI - [ ] Lab 7: Table Maintenance - [ ] Lab 8: Kafka Integration - [ ] Lab 9: CDC with Debezium - [ ] Lab 10: Spring Boot with Iceberg - [ ] Lab 11: Multi-Engine Lakehouse ## Time Estimates - **Beginner Path**: 2 hours - **Intermediate Path**: 3 hours - **Advanced Path**: 8 hours - **Complete Learning Path**: 13 hours ## Tips for Following the Path ### 1. Don't Rush - Understanding is more important than speed - Take time to read the conceptual guides - Review solutions even when you succeed ### 2. Practice Regularly - 30-60 minutes daily is better than 4 hours weekly - Consistency builds muscle memory - Revisit labs after breaks ### 3. Use Multiple Engines - Try the same operations in Spark, Trino, and DuckDB - Understand engine-specific behaviors - Learn which engine is best for which task ### 4. Learn from Mistakes - Understand why your solution failed - Read error messages carefully - Check the solution notebooks for patterns ### 5. Build on Previous Knowledge - Each lab builds on earlier concepts - Don't skip foundational labs - Reference completed labs when stuck ### 6. Apply to Real Work - Try patterns in your actual projects - Adapt labs to your use cases - Share learnings with your team ## Before Starting Each Lab 1. **Review Prerequisites**: Ensure you've completed required previous labs 2. **Read the Lab Guide**: Understand objectives and requirements 3. **Check Environment**: Verify all services are running 4. **Allocate Time**: Ensure you have the estimated time available 5. **Have Resources Ready**: Keep documentation and solution notebooks accessible ## After Completing Each Lab 1. **Review Solutions**: Compare with solution notebooks 2. **Document Learnings**: Note key concepts and patterns 3. **Practice Again**: Try variations of the exercises 4. **Teach Someone**: Explain concepts to reinforce learning 5. **Apply to Projects**: Use patterns in real scenarios ## Customizing Your Path Feel free to customize based on your needs: ### Focus on Specific Topics - **Iceberg Deep Dive**: Labs 2-4, 6-7 - **Streaming Focus**: Labs 5, 8-9 - **Multi-Engine Focus**: Labs 4, 6, 11 - **Application Development**: Labs 2, 5, 10 ### Time Constraints - **1 Hour**: Lab 2 only - **Half Day**: Labs 0-5 - **Full Day**: Labs 0-9 - **Two Days**: Complete all labs ### Skill Level Adjustment - **Beginner**: Complete all beginner labs, skip advanced - **Intermediate**: Complete beginner and intermediate, sample advanced - **Advanced**: Skip beginner, focus on intermediate and advanced ## Next Steps After completing the learning path: 1. **Build a Project**: Apply patterns to a real project 2. **Contribute**: Add new labs or improve existing ones 3. **Specialize**: Deep dive into specific areas (streaming, performance, etc.) 4. **Teach**: Share knowledge with your team or community 5. **Certify**: Pursue Apache Iceberg or related certifications ## Additional Resources - [Getting Started Guide](Getting-Started.md) - [Best Practices](Best-Practices.md) - [Troubleshooting](Troubleshooting.md) - [Iceberg Fundamentals](Iceberg-Fundamentals.md) - [Main Repository](https://github.com/nellaivijay/iceberg-code-practice) - Open an issue on GitHub for questions --- **Ready to start?** Begin with [Lab 0: Sample Database Setup](https://github.com/nellaivijay/iceberg-code-practice/blob/main/labs/lab-00-sample-database.md) 🚀