-
Notifications
You must be signed in to change notification settings - Fork 0
Learning Path
This guide suggests a structured learning path through the Apache Iceberg Practice Environment labs. While you can skip around, following this path ensures you build skills progressively.
The learning path is divided into three levels:
- Beginner (0-6 months experience): Foundation skills
- Intermediate (6-18 months experience): Production patterns
- Advanced (18+ months experience): System design and optimization
- Generate and load realistic business data
- Explore sample database schema and relationships
- Practice basic SQL queries
- Understand data distribution and patterns
Why start here: Provides realistic data for all subsequent labs. No Iceberg knowledge required.
- Verify all components are running
- Test catalog connectivity
- Validate storage access
- Perform first Iceberg operation
Why it's important: Ensures your environment works before diving into complex concepts.
- Create Iceberg tables
- Insert and query data
- Understand schema evolution basics
- Practice CRUD operations
Why it matters: Core Iceberg concepts you'll use in every lab.
Beginner Milestone: Complete Labs 0-2 (approximately 2 hours)
- Partitioning strategies
- Time travel queries
- Schema evolution with migrations
- Understanding snapshots and metadata
Why it's important: These features distinguish Iceberg from traditional data lakes.
- File compaction
- Snapshot management
- Query planning optimization
- Understanding metadata-only queries
Why it matters: Performance is critical in production environments.
- Slowly Changing Dimensions (SCD)
- Upsert operations
- Batch and streaming patterns
- Star schema implementation
Why it's important: These patterns are used in real data engineering projects.
Intermediate Milestone: Complete Labs 3-5 (approximately 3 hours)
- Complex Iceberg join operations
- Spark History Server UI exploration
- DAG inspection and metadata-only filtering
- Performance analysis and optimization
Why it's important: Understanding query execution helps optimize production workloads.
- File compaction and optimization strategies
- Snapshot management and expiration
- Orphan file cleanup and storage reclamation
- Table statistics collection and analysis
- Metadata optimization
- Monitoring and alerting setup
Why it's important: Maintenance is crucial for long-term production systems.
- Set up Apache Kafka for real-time data streaming
- Produce and consume events with Kafka
- Integrate Spark Structured Streaming with Iceberg
- Implement real-time analytics on streaming data
- Handle exactly-once processing semantics
Why it's important: Streaming is essential for modern data architectures.
- Configure Debezium for MySQL CDC
- Set up MySQL for change data capture
- Create and manage Debezium connectors
- Stream CDC events to Kafka topics
- Consume CDC events with Spark Structured Streaming
- Apply CDC changes to Iceberg tables
Why it's important: CDC enables real-time data synchronization across systems.
- Create Spring Boot applications with Iceberg integration
- Configure Iceberg catalog and table access
- Implement CRUD operations on Iceberg tables
- Build REST APIs for Iceberg data access
- Implement transaction handling and error management
Why it's important: Applications need to interact with data lakes efficiently.
- Configure multiple query engines (Spark, Trino, DuckDB)
- Ensure schema consistency across engines
- Implement engine-specific optimizations
- Handle data type conversions between engines
- Monitor and optimize multi-engine workloads
Why it's important: Modern lakehouses use multiple engines for different use cases.
Advanced Milestone: Complete Labs 6-11 (approximately 8 hours)
If you have 2+ years of data engineering experience:
- Skip Labs 0-1: Assume environment works
- Lab 2: Quick refresher on Iceberg basics (30 min)
- Labs 3-5: Focus on advanced features (2 hours)
- Labs 6-7: Performance and operations (2 hours)
- Choose specialization: Either streaming (8-9) or applications (10-11)
Total time: 5-6 hours
Focus on real-time data processing:
- Labs 0-2: Foundation (2 hours)
- Lab 5: Real-world patterns (1 hour)
- Lab 8: Kafka integration (1.5 hours)
- Lab 9: CDC with Debezium (1.5 hours)
- Lab 11: Multi-engine considerations (1.5 hours)
Total time: 7.5 hours
Focus on optimization and operations:
- Labs 0-2: Foundation (2 hours)
- Lab 3: Advanced features (1 hour)
- Lab 4: Spark optimizations (1 hour)
- Lab 6: Performance & UI (1.5 hours)
- Lab 7: Table maintenance (1.5 hours)
- Lab 11: Multi-engine optimization (1.5 hours)
Total time: 8.5 hours
Focus on building applications with Iceberg:
- Labs 0-2: Foundation (2 hours)
- Lab 5: Real-world patterns (1 hour)
- Lab 10: Spring Boot integration (1.5 hours)
- Lab 11: Multi-engine considerations (1.5 hours)
Total time: 6 hours
Use this checklist to track your progress:
- Lab 0: Sample Database Setup
- Lab 1: Environment Setup
- Lab 2: Basic Iceberg Operations
- Lab 3: Advanced Features
- Lab 4: Spark Optimizations
- Lab 5: Real-World Patterns
- Lab 6: Performance & UI
- Lab 7: Table Maintenance
- Lab 8: Kafka Integration
- Lab 9: CDC with Debezium
- Lab 10: Spring Boot with Iceberg
- Lab 11: Multi-Engine Lakehouse
- Beginner Path: 2 hours
- Intermediate Path: 3 hours
- Advanced Path: 8 hours
- Complete Learning Path: 13 hours
- Understanding is more important than speed
- Take time to read the conceptual guides
- Review solutions even when you succeed
- 30-60 minutes daily is better than 4 hours weekly
- Consistency builds muscle memory
- Revisit labs after breaks
- Try the same operations in Spark, Trino, and DuckDB
- Understand engine-specific behaviors
- Learn which engine is best for which task
- Understand why your solution failed
- Read error messages carefully
- Check the solution notebooks for patterns
- Each lab builds on earlier concepts
- Don't skip foundational labs
- Reference completed labs when stuck
- Try patterns in your actual projects
- Adapt labs to your use cases
- Share learnings with your team
- Review Prerequisites: Ensure you've completed required previous labs
- Read the Lab Guide: Understand objectives and requirements
- Check Environment: Verify all services are running
- Allocate Time: Ensure you have the estimated time available
- Have Resources Ready: Keep documentation and solution notebooks accessible
- Review Solutions: Compare with solution notebooks
- Document Learnings: Note key concepts and patterns
- Practice Again: Try variations of the exercises
- Teach Someone: Explain concepts to reinforce learning
- Apply to Projects: Use patterns in real scenarios
Feel free to customize based on your needs:
- Iceberg Deep Dive: Labs 2-4, 6-7
- Streaming Focus: Labs 5, 8-9
- Multi-Engine Focus: Labs 4, 6, 11
- Application Development: Labs 2, 5, 10
- 1 Hour: Lab 2 only
- Half Day: Labs 0-5
- Full Day: Labs 0-9
- Two Days: Complete all labs
- Beginner: Complete all beginner labs, skip advanced
- Intermediate: Complete beginner and intermediate, sample advanced
- Advanced: Skip beginner, focus on intermediate and advanced
After completing the learning path:
- Build a Project: Apply patterns to a real project
- Contribute: Add new labs or improve existing ones
- Specialize: Deep dive into specific areas (streaming, performance, etc.)
- Teach: Share knowledge with your team or community
- Certify: Pursue Apache Iceberg or related certifications
- Getting Started Guide
- Best Practices
- Troubleshooting
- Iceberg Fundamentals
- Main Repository
- Open an issue on GitHub for questions
Ready to start? Begin with Lab 0: Sample Database Setup 🚀