Getting Started

Getting Started with Apache Iceberg Code Practice

This guide will walk you through setting up your environment and completing your first coding lab.

Prerequisites

Before you begin, make sure you have:

Docker or Podman installed
16GB RAM minimum (for full environment)
40GB disk space available
Basic knowledge of SQL and data concepts
Familiarity with command line interface

Step 1: Clone the Repository

Choose one of these methods:

Option A: Using Git (Recommended)

git clone https://github.com/nellaivijay/iceberg-code-practice.git
cd iceberg-code-practice

Option B: Download ZIP

Go to https://github.com/nellaivijay/iceberg-code-practice
Click the green "Code" button
Select "Download ZIP"
Extract the files to your computer

Step 2: Choose Your Setup Method

Option 1: Kubernetes with k3s (Recommended)

Pros:

Production-like environment
Better resource isolation
Full feature set
Suitable for advanced labs

Cons:

More complex setup
Higher resource requirements
Requires k3s installation

Install k3s

curl -sfL https://get.k3s.io | sh -
# Verify installation
kubectl version --client

Setup the Environment

# Run setup script
./scripts/setup.sh

# Apply Kubernetes manifests
kubectl apply -f k8s/

# Wait for pods to be ready
kubectl get pods -w

Verify Setup

# Check all services are running
kubectl get pods

# Access Spark UI
kubectl port-forward svc/spark-master 8080:8080

# Access Trino UI
kubectl port-forward svc/trino 8081:8080

Option 2: Docker Compose (Lightweight)

Pros:

Quick to set up
Lower resource requirements
Easier to troubleshoot
Good for initial learning

Cons:

Limited resource isolation
Some advanced features may not work
Less production-like

Start the Environment

# Start all services
docker-compose up -d

# Check services are running
docker-compose ps

# View logs
docker-compose logs -f

Verify Setup

# Access Spark UI
open http://localhost:8080

# Access Trino UI
open http://localhost:8081

# Check Iceberg catalog
docker-compose exec spark bash

Step 3: Load Sample Data

The environment includes a comprehensive sample database for hands-on learning.

Generate Sample Data

python3 scripts/generate_sample_data.py

This creates:

1,000 customer records
200 product records
5,000 order records
10,000 transaction records
20,000 web event records

Load Sample Data into Iceberg

./scripts/load_sample_data.sh

This loads the generated data into Iceberg tables using Spark.

Verify Sample Data

# Access Spark shell
docker-compose exec spark bash
spark-sql

# In Spark SQL
SHOW DATABASES;
USE sample_db;
SHOW TABLES;
SELECT COUNT(*) FROM sample_customers;

Step 4: Complete Your First Lab

Let's start with Lab 0: Sample Database Setup

Lab 0: Sample Database Setup

Objective: Explore the sample database and practice basic queries

Prerequisites: Environment setup complete, sample data loaded

Estimated Time: 30-45 minutes

Step 1: Access the Lab

# Open the lab markdown file
cat labs/lab-00-sample-database.md

# Or open in your preferred editor

Step 2: Follow the Instructions

The lab will guide you through:

Understanding the sample database schema
Exploring table relationships
Writing queries to answer business questions
Understanding data distribution and patterns

Step 3: Use the Jupyter Notebook (Optional)

# Start Jupyter (if using Docker Compose)
docker-compose exec spark jupyter notebook

# Navigate to notebooks/lab-00-sample-database.ipynb

Step 4: Check Your Work

Compare your results with the solution:

# View solution notebook
cat solutions/lab-00-sample-database-solution.ipynb

Step 5: Move to Lab 1

After completing Lab 0, proceed to Lab 1: Environment Setup

Lab 1: Environment Setup

Objective: Verify all components and perform your first Iceberg operation

Estimated Time: 30-45 minutes

This lab will:

Verify catalog connectivity
Test storage access
Create your first Iceberg table
Perform basic read/write operations

Common Setup Issues

Issue: Docker containers won't start

Solution:

# Check Docker is running
docker ps

# Check available disk space
df -h

# Check available memory
free -h

# Restart Docker
sudo systemctl restart docker

Issue: k3s pods stuck in Pending state

Solution:

# Check pod status
kubectl describe pod <pod-name>

# Check node resources
kubectl top nodes

# Check for resource limits
kubectl get nodes -o yaml | grep -A 5 resources

Issue: Out of memory errors

Solution:

Reduce the number of running services
Increase system RAM or swap space
Use Docker Compose instead of k3s for lighter footprint
Adjust memory limits in docker-compose.yaml

Issue: Storage backend connection errors

Solution:

# Check storage service is running
docker-compose ps objectscale

# Check storage logs
docker-compose logs objectscale

# Verify environment variables
env | grep STORAGE

Tips for Success

Start with Docker Compose

If you're new to Docker or have limited resources, start with Docker Compose. It's easier to troubleshoot and requires fewer resources.

Allocate Enough Resources

Minimum 16GB RAM for full environment
8GB RAM may work with reduced services
At least 40GB disk space for data and logs

Use the Solution Notebooks

If you get stuck, check the solution notebooks in the solutions/ folder. They provide complete working examples.

Take Notes

Document your setup steps and any issues you encounter. This will help you troubleshoot later and contribute improvements.

Join the Community

Open GitHub Issues for problems
Share your solutions and insights
Contribute improvements to the labs

Next Steps

After completing Lab 0 and Lab 1:

Lab 2: Basic Iceberg Operations - Learn core table operations
Lab 3: Advanced Features - Partitioning, time travel, schema evolution
Lab 4: Spark Optimizations - Performance tuning
Follow the Learning Path: See Learning Path for recommended order

Environment URLs (Default)

Once setup is complete, you can access:

Spark UI: http://localhost:8080
Spark History Server: http://localhost:18080
Trino UI: http://localhost:8081
Grafana (if configured): http://localhost:3000
MinIO Console (if using MinIO): http://localhost:9000

Stopping the Environment

Docker Compose

docker-compose down
# Or to remove volumes
docker-compose down -v

Kubernetes

kubectl delete -f k8s/
# Or delete specific resources
kubectl delete deployment spark-master

Cleaning Up

To completely remove the environment:

# Docker Compose
docker-compose down -v
docker system prune -a

# Kubernetes
kubectl delete -f k8s/
k3s-uninstall.sh

Need Help?

Check the Troubleshooting page
Review Best Practices
Open an issue on GitHub
Start a discussion in GitHub Discussions

Ready to start learning? Begin with Lab 0 🚀

Getting Started

Getting Started with Apache Iceberg Code Practice

Prerequisites

Step 1: Clone the Repository

Option A: Using Git (Recommended)

Option B: Download ZIP

Step 2: Choose Your Setup Method

Option 1: Kubernetes with k3s (Recommended)

Install k3s

Setup the Environment

Verify Setup

Option 2: Docker Compose (Lightweight)

Start the Environment

Verify Setup

Step 3: Load Sample Data

Generate Sample Data

Load Sample Data into Iceberg

Verify Sample Data

Step 4: Complete Your First Lab

Lab 0: Sample Database Setup

Step 1: Access the Lab

Step 2: Follow the Instructions

Step 3: Use the Jupyter Notebook (Optional)

Step 4: Check Your Work

Step 5: Move to Lab 1

Lab 1: Environment Setup

Common Setup Issues

Issue: Docker containers won't start

Issue: k3s pods stuck in Pending state

Issue: Out of memory errors

Issue: Storage backend connection errors

Tips for Success

Start with Docker Compose

Allocate Enough Resources

Use the Solution Notebooks

Take Notes

Join the Community

Next Steps

Environment URLs (Default)

Stopping the Environment

Docker Compose

Kubernetes

Cleaning Up

Need Help?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally