The definitive, community-driven reference for modern data lakehouse engineering β comparing Delta Lake and Apache Iceberg with production-tested recipes, automated freshness tracking, and weekly AI-powered content discovery.
Explore the full site at analytical-guide.github.io/Datalake-Guide β searchable documentation, tutorials, and curated resources, automatically kept up to date.
Choosing between Delta Lake and Apache Icebergβor deciding how to deploy either in productionβis non-trivial. This hub solves that by providing:
| Need | What You'll Find |
|---|---|
| Understand the differences | Feature comparison matrix with 60+ criteria across 10 dimensions |
| Get hands-on quickly | CI/CD-validated code recipes for common workloads |
| Move to production | Production readiness guide with advanced compaction, monitoring, and DR patterns |
| Migrate existing systems | Step-by-step migration guide covering Parquet, Hive, and cross-cloud scenarios |
| Stay current | Weekly automated discovery of new articles and blog posts from trusted sources |
Datalake-Guide/
βββ docs/
β βββ comparisons/feature-matrix.md β Delta vs Iceberg (60+ features)
β βββ tutorials/getting-started.md β Quickstart for both formats
β βββ tutorials/migration-guide.md β Parquet β Delta/Iceberg migration
β βββ best-practices/production-readiness.md β Production checklist
β βββ architecture/system-overview.md β Hub automation architecture
β βββ awesome-list.md β Curated resources (auto-updated)
βββ code-recipes/examples/ β Runnable, validated code recipes
βββ community/ β Contributor stats & processed URLs
βββ scripts/ β Automation scripts
βββ .github/workflows/ β 8 automated GitHub Actions
| Section | Location | Description |
|---|---|---|
| Feature Matrix | docs/comparisons/feature-matrix.md |
60+ feature comparison across 10 dimensions |
| Code Recipes | code-recipes/ |
CI-validated, production-ready examples |
| Getting Started | docs/tutorials/getting-started.md |
Hands-on quickstart for both technologies |
| Migration Guide | docs/tutorials/migration-guide.md |
Parquet, Hive β Delta/Iceberg with validation scripts |
| Production Readiness | docs/best-practices/production-readiness.md |
Advanced patterns for production deployments |
| Architecture | docs/architecture/system-overview.md |
Hub automation and workflow architecture |
| Awesome List | docs/awesome-list.md |
Curated resources, updated weekly by AI aggregator |
| Knowledge Quiz | quiz.md |
Test and track your knowledge |
- π Feature Comparison Matrix β 60+ criteria, benchmarks, and decision framework
- π¨βπ» Code Recipes β Production-ready examples with CI validation
- π Getting Started β First Delta/Iceberg table in minutes
- π Migration Guide β Parquet/Hive β modern format
- ποΈ Production Readiness β Best practices for production
- π€ Contributing Guide β Earn points, join the community
- π Code of Conduct β Community standards
- π Community Leaderboard β Top contributors
Unlike traditional static documentation, this repository is designed as a living knowledge base that continuously evolves through automation:
| Automation | Trigger | What It Does |
|---|---|---|
| Code Recipe CI | Every PR | Lints Python, runs validate.sh per recipe |
| Documentation CI | Every PR | Markdownlint, link checker, Mermaid diagram validation |
| Stale Content Bot | Weekly (Mon) | Opens issues for docs untouched > 12 months |
| Resource Aggregator | Weekly (Sun) | Discovers new articles from RSS feeds, commits to awesome list |
| Leaderboard Update | Daily | Regenerates top-10 contributor table in README |
| Gamification Engine | PR/Review/Issue | Awards points and updates contributor stats |
| Quiz Leaderboard | Issue comment | Updates quiz scores in the leaderboard issue |
All architecture diagrams use Mermaid.js so every diagram is version-controlled and diffable alongside the content it describes.
| Layer | Technologies |
|---|---|
| Data Formats | Delta Lake 3.x, Apache Iceberg 1.5+ |
| Languages | Python 3.8+, SQL, Scala |
| Automation | GitHub Actions (8 workflows) |
| Documentation | Markdown, Mermaid.js, Jekyll |
| Code Quality | black, flake8, markdownlint, typos |
| Link Health | lychee link checker |
| Content Discovery | feedparser, BeautifulSoup, optional LLM APIs |
| Step | Goal | Resource |
|---|---|---|
| 1 | Compare technologies | Feature Matrix |
| 2 | Set up your environment | Getting Started Tutorial |
| 3 | Try runnable examples | Code Recipes |
| 4 | Move to production | Production Readiness Guide |
| 5 | Migrate existing systems | Migration Guide |
| 6 | Test your knowledge | Knowledge Quiz |
- Read our Contributing Guide β contributions earn points on the leaderboard
- Check open issues for areas needing help
- Review the Code of Conduct
- Submit your first pull request β the gamification engine awards points automatically!
- Ruby: 2.7+ (for Jekyll)
- Python: 3.8+ (for scripts and validation)
- Node.js: 16+ (optional, for additional tooling)
- Git: Latest version
-
Clone the repository
git clone https://github.com/Analytical-Guide/Datalake-Guide.git cd Datalake-Guide -
Install Jekyll and dependencies
# Install Bundler if not already installed gem install bundler # Install project dependencies bundle install
-
Install Python dependencies (for validation scripts)
pip install -r requirements-dev.txt
-
Start local development server
# Serve with live reload bundle exec jekyll serve --livereload # Or build and serve bundle exec jekyll build && bundle exec jekyll serve
-
Open your browser
- Navigate to
http://localhost:4000/Datalake-Guide/ - The site will automatically reload when you make changes
- Navigate to
-
Create a feature branch
git checkout -b feature/your-feature-name
-
Make your changes
- Edit Markdown files in
docs/,code-recipes/, etc. - Update styles in
assets/css/main.css - Modify scripts in
scripts/
- Edit Markdown files in
-
Test your changes
# Run validation tests python scripts/validate_site.py # Check for broken links python scripts/check_internal_links.py # Build the site bundle exec jekyll build
-
Preview changes locally
bundle exec jekyll serve
- Markdown: Follow the style guide in
CONTRIBUTING.md - CSS: Use the established design system (see Design System Documentation)
- JavaScript: Follow modern ES6+ standards with accessibility in mind
- Python: Use Black for formatting, follow PEP 8
Run the comprehensive test suite:
# Run all validation tests
python scripts/validate_site.py
# Check internal links
python scripts/check_internal_links.py
# Validate code recipes
find code-recipes -name "validate.sh" -exec bash {} \;The site is automatically deployed to GitHub Pages via GitHub Actions:
-
Push to main branch
git add . git commit -m "Your commit message" git push origin main
-
GitHub Actions will:
- Build the Jekyll site
- Run validation tests
- Deploy to GitHub Pages
- Report any failures
For manual deployment or custom environments:
# Build for production
JEKYLL_ENV=production bundle exec jekyll build
# Deploy to custom server
rsync -avz _site/ user@server:/path/to/site/Key settings in _config.yml:
url: Site URL for absolute linksbaseurl: Subpath for GitHub Pagesrepository: GitHub repository for linksplugins: Enabled Jekyll plugins
Available in _config.yml:
github_url: Full GitHub repository URLissues_url: Issues page URLdiscussions_url: Discussions page URL
-
Jekyll build fails
# Clear Jekyll cache rm -rf .jekyll-cache _site # Reinstall dependencies bundle install # Try building again bundle exec jekyll build
-
Python scripts fail
# Ensure Python 3.8+ python --version # Install/update dependencies pip install -r requirements-dev.txt
-
Links are broken
# Run link checker python scripts/check_internal_links.py # Fix any reported issues
-
Styling issues
- Check browser developer tools for CSS errors
- Ensure design system variables are used correctly
- Test responsive design across breakpoints
- Issues: Report bugs
- Discussions: Ask questions
- Documentation: Check local development docs
The site includes performance optimizations:
- Font loading: Optimized with
font-display: swap - CSS: Minified and optimized
- Images: Lazy loading support
- JavaScript: Progressive enhancement
Monitor performance using:
- Lighthouse: Browser dev tools
- WebPageTest: External performance testing
- GitHub Actions: Automated performance checks
Thank you to our amazing community members who make this knowledge hub possible!
| Rank | Contributor | Points | PRs | Reviews | Issues |
|---|---|---|---|---|---|
| π₯ #1 | @Copilot | 75 | 2 | 0 | 0 |
| π₯ #2 | @moshesham | 13 | 1 | 0 | 1 |
Last updated: 2026-05-06 13:51 UTC
Want to see your name here? Check out our Contributing Guide to get started!
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Issues: Report bugs or request features
- Discussions: Join community discussions
- Pull Requests: Contribute code or documentation
This knowledge hub is made possible by our amazing community of contributors. Thank you to everyone who has helped make this resource valuable for data engineers worldwide!
Built with β€οΈ by the data engineering community