A modified version of the blessed Metaflow Cloudformation template that enables a GPU queue by default.
There are three changes to pay attention to as you set up your cluster.
This section in the Parameters:
GPUComputeEnvInstanceTypes:
Type: CommaDelimitedList
Default: "p3.2xlarge,p3.8xlarge,p3.16xlarge"
Description: "The instance types for the p3 GPU compute environment as a comma-separated list"
This section in the Resources:
ComputeEnvironmentGPU:
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ServiceRole: !GetAtt 'BatchExecutionRole.Arn'
ComputeResources:
MaxvCpus: !Ref MaxVCPUBatchGPUEnv
SecurityGroupIds:
- !GetAtt VPC.DefaultSecurityGroup
Type: !If [EnableFargateOnBatch, 'FARGATE', 'EC2']
Subnets:
- !Ref Subnet1
- !Ref Subnet2
MinvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref MinVCPUBatch]
InstanceRole: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !GetAtt 'ECSInstanceProfile.Arn']
InstanceTypes: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref GPUComputeEnvInstanceTypes]
DesiredvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref DesiredVCPUBatch]
State: ENABLED
The JobQueueGPU section in the Resources:
JobQueueGPU:
DependsOn: ComputeEnvironmentGPU
Type: AWS::Batch::JobQueue
Properties:
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref ComputeEnvironmentGPU
State: ENABLED
Priority: 1
JobQueueName: !Join [ '-', [ 'job-queue-gpu', !Ref AWS::StackName ] ]
Go to AWS Console, and upload the cfn.yaml file (it is too big for char count using AWS CLI).
pip install metaflow boto3You can either do this manually by setting up the Metaflow config file using the Outputs tab in Cloudformation console, or run this script (assumes your AWS credentials are set).
STACK_NAME=metaflow
python mf_configure.py -s $STACK_NAMEThis will run in the job-queue-metaflow queue (unless you changed the name).
python flow.py runNote: By default, a fresh AWS account that I created in April 2024 had 0 on-demand P instance type quotas.
In order to run this code, go in AWS Console to Service Quotas/AWS services/Amazon Elastic Compute Cloud (Amazon EC2)/Running On-Demand P instances.
This will run in the job-queue-gpu-metaflow queue (look at the start step @batch decorator).
python gpu_flow.py runAWS seems to have all sorts of churn in Postgres versions on RDS. Activate your desired AWS profile, and run this command to find available versions:
aws rds describe-db-engine-versions --engine postgres --query 'DBEngineVersions[].EngineVersion' --output tableEdit the EngineVersion of this section
RDSMasterInstance:
Type: AWS::RDS::DBInstance
Properties:
DBName: 'metaflow'
AllocatedStorage: 20
DBInstanceClass: 'db.t3.small'
DeleteAutomatedBackups: 'true'
StorageType: 'gp2'
Engine: 'postgres'
EngineVersion: '16.4'
MasterUsername: !Join ['', ['{{resolve:secretsmanager:', !Ref MyRDSSecret, ':SecretString:username}}' ]]
MasterUserPassword: !Join ['', ['{{resolve:secretsmanager:', !Ref MyRDSSecret, ':SecretString:password}}' ]]
VPCSecurityGroups:
- !Ref 'RDSSecurityGroup'
DBSubnetGroupName: !Ref 'DBSubnetGroup'