Pipeline Zen Jobs Scheduler

Pipeline Zen Jobs Scheduler is a robust system designed to manage and scale compute resources for job processing across multiple clusters and regions in Google Cloud Platform (GCP). This system optimizes the allocation and utilization of GPU resources, significantly reducing wait times for computational jobs and ensuring efficient resource usage.

Project Purpose

The Pipeline Zen Jobs Scheduler addresses the challenges of GPU resource management in cloud environments where availability can be limited or unpredictable. By implementing a multi-region approach, the system:

Enables efficient sourcing of GPUs from various geographical locations
Reduces wait times for GPU resources, ensuring prompt job starts
Implements automatic scaling of Managed Instance Groups (MIGs)
Dynamically adjusts the number of GPU instances based on workload demands
Optimizes resource usage and cost-efficiency

This intelligent scaling mechanism provides the right amount of computational power when and where it's needed most, offering a robust solution for organizations requiring flexible, responsive, and efficient management of GPU resources for their computational workloads.

System Architecture

The system is composed of several key components that work together to manage jobs and scale resources:

Scheduler: The central component that manages the overall job scheduling process, interacts with the database, coordinates with the ClusterOrchestrator, uses PubSub for messaging, and periodically monitors and updates the system state.
ClusterOrchestrator: Oversees all clusters in the system, maintaining a collection of ClusterManagers, coordinating scaling operations across all clusters, and aggregating status information.
ClusterManager: Responsible for a specific cluster configuration, managing operations across multiple regions, handling scaling decisions, and interacting with the MigManager.
MigManager: Directly interacts with Google Cloud's Managed Instance Groups (MIGs), performing API calls to scale MIGs, get MIG information, and list VMs.
PubSub Client: Facilitates asynchronous communication within the system, publishing messages about new jobs, listening for heartbeat messages, and sending stop signals to running jobs.
Database: Manages job tracking and persistence using PostgreSQL.

High Level System Design Diagram

High Level System Design

Key Terminology

Cluster: A logical grouping of compute resources with similar specifications (e.g., "4xa100-40gb" for a cluster of machines with 4 A100 40GB GPUs each).
Region: A geographic area where GCP resources can be located (e.g., "us-central1").
MIG (Managed Instance Group): A GCP resource that maintains a group of identical VM instances, allowing for easy scaling and management.
VM (Virtual Machine): An individual compute instance within a MIG.
Job: A unit of work that needs to be processed on the compute resources.
Scaling: The process of adjusting the number of VMs in a MIG based on workload demands.
PubSub: Google Cloud Pub/Sub, a messaging service used for sending and receiving messages between components.

Workflow

The Scheduler receives a new job request.
The job is added to the database and a message is published via PubSub to start the job.
The Scheduler determines which cluster should handle the job based on its requirements.
The ClusterOrchestrator is notified to check if scaling is necessary.
If scaling is needed, the appropriate ClusterManager uses the MigManager to adjust the size of the relevant MIGs.
The job starts running on an available VM.
The running job sends periodic heartbeat messages via PubSub, which the Scheduler uses to update the job status.
The Scheduler continues to monitor running jobs and pending workloads, adjusting resources as needed.
When a job completes or needs to be stopped, the Scheduler sends a message via PubSub to the relevant VM.

Configuration

The system is configured with:

A list of available clusters and their specifications.
The regions associated with each cluster.
Maximum scale limits for each cluster.
GCP project information and credentials.
PubSub topic and subscription names for various message types.

This configuration allows the system to make informed decisions about resource allocation and scaling across different types of compute resources and geographic regions, while maintaining efficient communication between components.

Data Flow

Job Requests → Scheduler → Database
Scheduler → PubSub → Job Start Messages
Running Jobs → PubSub → Heartbeat Messages → Scheduler
Scheduler → ClusterOrchestrator → ClusterManagers → MigManager → GCP (for scaling)
Scheduler → PubSub → Job Stop Messages (when needed)

Getting Started

For instructions on how to set up and run the Pipeline Zen Jobs Scheduler, please refer to the RUNME.md file.

For detailed usage instructions, including how to schedule jobs, monitor status, and interact with the system, please see the USAGE.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.run		.run
app-configs		app-configs
assets		assets
scripts/vm-runtime		scripts/vm-runtime
src/app		src/app
terraform		terraform
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
RUNME.md		RUNME.md
USAGE.md		USAGE.md
docker-compose.yml		docker-compose.yml
example.env		example.env
pytest.ini		pytest.ini
repomix.config.json		repomix.config.json
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
scheduler-zen.iml		scheduler-zen.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline Zen Jobs Scheduler

Project Purpose

System Architecture

High Level System Design Diagram

Key Terminology

Workflow

Configuration

Data Flow

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pipeline Zen Jobs Scheduler

Project Purpose

System Architecture

High Level System Design Diagram

Key Terminology

Workflow

Configuration

Data Flow

Getting Started

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages