Skip to content

Latest commit

 

History

History
298 lines (216 loc) · 19.1 KB

File metadata and controls

298 lines (216 loc) · 19.1 KB

GCP Data Engineering Learning Repository

Apache Beam Google Cloud Python Java

A comprehensive learning repository for Google Cloud Platform (GCP) Data Engineering that demonstrates modern data pipeline development using Apache Beam, BigQuery, Pub/Sub, and other GCP services. This project covers both Python and Java implementations with real-world examples and hands-on exercises.

📋 Table of Contents

🎯 Project Overview

This repository serves as a complete learning platform for data engineering concepts, focusing on:

  • Batch and Stream Processing with Apache Beam
  • Data Pipeline Development in Python and Java
  • GCP Services Integration (BigQuery, Pub/Sub, Cloud Storage)
  • Real-world Data Processing scenarios
  • Best Practices for scalable data engineering

📁 Repository Structure

GCP-DataEngineering/
├── 📂 Apache Beam/                    # Python-based Apache Beam examples
├── 📂 ApacheBeam-Java/               # Java-based Apache Beam examples
├── 📂 GCP BigQuery/                  # BigQuery integration examples
├── 📂 GCP PubSub/                    # Pub/Sub messaging examples
├── 📂 Datasets/                      # Sample datasets for practice
└── 📄 README.md                      # This file

🐍 Apache Beam - Python

Location: Apache Beam/

Core Files

Examples Directory: Examples/

Progressive learning examples covering core Beam concepts:

File Concept Description
01_Beam_create_integers.py Create Creating PCollections from integers
02_Beam Create Key-Value Pairs.py Key-Value Working with key-value pair data
03_Beam Create objects.py Objects Creating PCollections from custom objects
04_Beam Create String.py Strings String data processing
05_Beam Filter.py Filter Filtering data based on conditions
06_Beam Map Elements to Formatted String.py Map Transforming elements to formatted strings
07_Beam Map Elements.py Map Basic element transformation
08_Beam FlatMap.py FlatMap Flattening nested collections
09_Beam FlatMap Elements from List to Integer.py FlatMap Converting lists to individual integers
10_Beam Group by Key and Sum.py GroupByKey Grouping and aggregating data
11_Beam Group by Key.py GroupByKey Basic grouping operations
12_Beam ParDo (Parallel Do).py ParDo Parallel data processing
13_Beam ParDo with Key-Value.py ParDo ParDo with key-value pairs
14_WordCount.ipynb WordCount Classic word counting example

Main Functions Directory: Main Functions/

In-depth Jupyter notebooks covering essential Beam transformations:

Notebook Transform Description
01_Create.ipynb Create Creating PCollections from various sources
02_ReadTransform.ipynb Read Reading data from files and external sources
03_WriteTransform.ipynb Write Writing data to various sinks
04_FlatMap.ipynb FlatMap Advanced flattening operations
05_Map.ipynb Map Element-wise transformations
06_FilterLambda.ipynb Filter Lambda-based filtering
07_Filter.ipynb Filter Advanced filtering techniques
08_Flatten.ipynb Flatten Combining multiple PCollections
09_CombinePerKey.ipynb CombinePerKey Aggregation operations
10_CountPerKey.ipynb Count Counting elements per key
11_CogroupByKey.ipynb CoGroupByKey Joining multiple PCollections

Datasets: datasets/


☕ Apache Beam - Java

Location: ApacheBeam-Java/

Project Structure

File Concept Description
BeamExample.java Basic Pipeline Simple Beam pipeline example
code_01_BeamCreate_Integer.java Create Creating integer PCollections
code_01_BeamCreate_KV.java Key-Value Key-value pair creation
code_01_BeamCreate_Objects.java Objects Custom object processing
code_01_BeamCreate_String.java Strings String data handling
code_02_BeamFilter.java Filter Data filtering operations
code_03_BeamMapElements_formattedStringOutput.java Map Formatted string output
code_04_BeamFlatMap.java FlatMap Collection flattening
code_05_BeamGroupByKey.java GroupByKey Data grouping
code_05_BeamGroupByKey_Sum.java GroupByKey Grouping with summation
code_06_BeamParDo.java ParDo Parallel processing
code_07_BeamParDo_KeyValue.java ParDo ParDo with key-value data
File Concept Description
code_01_BeamWindowing.java Windowing Time-based data windowing
code_01_BeamWindowing_Demo.java Windowing Advanced windowing demo
code_02_BeamSideInputs.java Side Inputs Data enrichment patterns
code_02_BeamStatefulProcessing.java Stateful Stateful data processing
code_03_BeamPipeline.java Pipeline Complex pipeline configurations
File Use Case Description
code_01_wordcount.java Word Count Classic word counting implementation
code_02_even_odd.java Classification Even/odd number classification
code_03_average_numbers.java Aggregation Numerical average calculation
code_03_average_numbers_combineApproach.java Combine Average using Combine transforms

☁️ GCP Services Integration

BigQuery Integration: GCP BigQuery/

Notebook Focus Description
01_Load_from_StorageBucket.ipynb Data Loading Loading data from Cloud Storage to BigQuery
02_BigQuery_Datasets_Python.ipynb Dataset Management Creating and managing BigQuery datasets
03_BigQuery_Tables_Python.ipynb Table Operations BigQuery table creation and manipulation
04_Load_to_StorageBucket.ipynb Data Export Exporting BigQuery data to Cloud Storage

Pub/Sub Messaging: GCP PubSub/

Notebook Focus Description
01_PubSub_messaging.ipynb Messaging Complete Pub/Sub messaging implementation

📊 Sample Datasets

Location: Datasets/

File Type Description
employee.txt Employee Data Sample employee records
titanic_dataset.csv Historical Data Famous Titanic passenger dataset

🔧 Prerequisites

Software Requirements

  • Python 3.7+ with pip
  • Java 8+ with Maven
  • Google Cloud SDK (gcloud CLI)
  • Git for version control

GCP Setup

  1. GCP Project: Create or use existing GCP project
  2. Authentication: Configure gcloud authentication
  3. APIs: Enable required APIs (BigQuery, Pub/Sub, Cloud Storage)
  4. Service Account: Create service account with appropriate permissions

Python Dependencies

pip install apache-beam[gcp]
pip install google-cloud-bigquery
pip install google-cloud-pubsub
pip install mysql-connector-python

Java Dependencies

Maven dependencies are configured in pom.xml


🚀 Getting Started

1. Clone the Repository

git clone https://github.com/TheDataArtisanDev/GCP-DataEngineering.git
cd GCP-DataEngineering

2. Set Up Python Environment

# Create virtual environment
python -m venv beam-env
source beam-env/bin/activate  # On Windows: beam-env\Scripts\activate

# Install dependencies
pip install -r requirements.txt  # Create this based on imports

3. Configure GCP

# Authenticate with GCP
gcloud auth login

# Set your project
gcloud config set project YOUR_PROJECT_ID

# Create application default credentials
gcloud auth application-default login

4. Run Your First Example

# Python example
cd "Apache Beam/Examples"
python 01_Beam_create_integers.py

# Java example
cd ApacheBeam-Java
mvn compile exec:java -Dexec.mainClass="com.example.BeamExample"

🎓 Key Learning Outcomes

After completing this repository, you will understand:

  • Apache Beam fundamentals in Python and Java
  • Data pipeline design patterns and best practices
  • GCP service integration for end-to-end data workflows
  • Batch and stream processing concepts
  • Scalable data transformation techniques
  • Real-world data engineering scenarios

🤝 Contributing

Contributions are welcome! This is a learning project, so feel free to:

  • Report issues or bugs you find
  • Suggest improvements to examples
  • Add more use cases or examples
  • Fix documentation or code issues

For major changes, please open an issue first to discuss what you would like to change.


Useful Links


Happy Learning! 🚀

This repository is a comprehensive data engineering learning journey. If you find it helpful, feel free to use it for your own learning!