This repo hosts minimum viable self-contained, end-to-end solutions that showcase Spark on GCP and its integration with other GCP services. They are intended to demystify our products and their integration.
| # | About | Tags |
|---|---|---|
| 01 | Orchestration of Spark jobs on Dataproc Serverless with Airflow on Cloud Composer 2 | #Spark-On-Dataproc-Serverless #Spark #Airflow #Composer |
| 02 | Orchestration of Spark jobs on Dataproc GCE cluster with Airflow on Cloud Composer 2 | #Spark-On-Dataproc-GCE #Spark #Airflow #Composer |
| 03 | Just enough Dataproc on GKE | #Spark-On-Dataproc-GKE #Spark #GKE #Kubernetes #Spark-on-Kubernetes |
| 04 | Just enough Dataproc on GCE with GPU acceleration | #Spark-On-Dataproc-GCE #Spark #GPU #spark-rapics #Nvidia |
| 05 | Just enough Dataproc Serverless Spark with GPU acceleration | #Spark-On-Dataproc-Serverless #Spark #Airflow #Composer |
| 06 | Just enough Dataproc Workspaces for Data Scientists and Data Engineers | #Spark-On-Dataproc-GCE # Workspaces |
| 07 | BYO Jupyter for Dataproc GCE clusters and Dataproc Serverless with Dataproc Jupyter Plugin | #Spark-On-Dataproc-Serverless #Spark-On-Dataproc-GCE #Spark #Airflow #Composer #Jupyter #BYO-Jupyter |
| 08 | Just enough Delta Lake on GCP | #Spark-On-Dataproc-Serverless #Spark #Airflow #Composer #DeltaLake #TableFormat #BigLake #BigLake-Manifest |
| 09 | Just enough Apache Hudi on GCP | #Spark-On-Dataproc-GCE #Spark #Airflow #Composer #Hudi #TableFormat |
| 10 | Scalable Machine Learning with Spark on GCP and Vertex AI | #Spark-On-Dataproc-Serverless #Spark #MLOps #Composer #VertexAI #Kubeflow |
| 11 | Just enough Terraform for Data Analytics on Google Cloud | #Spark-On-Dataproc #Composer #VertexAI #BigQuery #IaaC #Terraform |
| 12 | Near Real Time processing with Spark on Google Cloud and Confluent Cloud | #Spark-On-Dataproc-Serverless #Spark #Kafka #Streaming #ConfluentCloud |
| 13 | Code free integration with Dataproc Templates powered by Dataproc Serverless Spark | #Spark-On-Dataproc-Serverless #Spark #Dataproc-Templates #Codefree-Serverless-Spark-Integration |
| 14 | Data Governance on Google Cloud for OSS Analytics | #DataGovernace #Dataplex #DataCatalog #DataLineage |
| 15 | Lineage for Dataproc Spark jobs | #Spark-On-Dataproc-GCE #Spark #Airflow #Composer #DataGovernace #Lineage #Dataplex |
Google Cloud’s Assured Workloads helps ensure that regulated organizations across the public and private sector can accelerate AI innovation while meeting their compliance and security requirements. Assured Workloads provides control packages to support the creation of compliant boundaries in Google Cloud. A control package is a set of controls that, when combined together, supports the regulatory baseline for a compliance statute or regulation. These controls include mechanisms to enforce data residency, data sovereignty, personnel access, and more.
We encourage you to evaluate Assured Workloads' control packages and decide whether a control package is required for your organization to meet their regulatory and compliance requirements. If so, we recommend you first deploy Assured Workloads using this repository, allowing you to maintain your regulatory and compliance requirements, before running these labs.
Note that unsupported products are not recommended for use by Assured Workloads customers without due diligence and waivers from your regulatory agencies or divisions.
| # | Google Cloud Collaborators | Contribution |
|---|---|---|
| 1. | Anagha Khanolkar | Author of all labs - vision, architecture, design, diagrams, and source code |
| 2. | Rick (Rugui) Chen | (Google Kubernetes Specialist) Support for GKE aspects for the lab on Dataproc on GKE |
| 3. | Dagang Wei | (Google Engineering) Dataproc image support for Apache Hudi on GCP |
| 4. | Nvidia | Dataproc with GPUs - base example & tuning |
| 5. | Jay O' Leary | Testing & feedback |