Motivation
When a customer deploys a Quilt CloudFormation stack into a locked-down VPC (no Internet Gateway, no NAT, incomplete VPC endpoints), the stack hangs on ECS service creation because tasks can't pull images from ECR. The failure mode is silent — CloudFormation events don't say "ECR unreachable," and CloudWatch log groups may never even be created.
This came up on a Biogen support call (2026-05-15): we spent most of a 23-minute meeting diagnosing exactly this, with engineers guessing at which VPC endpoints were missing. Alexei committed to sending a reachability-test script, but it doesn't exist as a reusable tool. We should ship it as a quiltx ecs subcommand so support can hand customers a one-liner.
Belongs under ecs because the test must run from inside an ECS task in the customer's VPC — that's the exact network context where the failure occurs, and quiltx/ecs.py already has the task-launching plumbing (run_task, wait_for_task, get_network_config).
Related: docs/advanced-features/private-endpoint-access.md in the quilt repo needs to link to this once it exists.
Proposed UX
quiltx ecs reachability --stack <stack-name> # auto-discover VPC/subnets from stack
quiltx ecs reachability --vpc vpc-xxx --subnet subnet-yyy
quiltx ecs reachability script # emit portable bash script (no deploy needed)
Output: a table of service → endpoint → reachable? → resolved IP (public/private) so the customer can see exactly which AWS services their VPC cannot reach.
What it should check
From the call and a read of the CFN template, at minimum:
- ECR API (
api.ecr.<region>.amazonaws.com)
- ECR DKR (
*.dkr.ecr.<region>.amazonaws.com) — image pulls
- S3 (gateway endpoint)
- CloudWatch Logs
- SNS
- SQS
- Secrets Manager
- STS
- API Gateway (interface endpoint, since
ApiGatewayVPCEndpointId is a stack param)
The exact list should be derived from the CFN template, not hardcoded — open question below.
Implementation sketch
- Reuse the existing
run_task / get_network_config helpers in quiltx/ecs.py to launch a short-lived task in the target VPC/subnet that runs DNS lookups + TCP connect tests against each service endpoint, then returns JSON via logs.
- Chicken-and-egg: if ECR is unreachable, the probe task itself can't pull its image. Options:
- Use a public ECR image already cached in AWS (e.g.,
public.ecr.aws/amazonlinux/amazonlinux) — still needs egress, so doesn't help in the worst case
- Fall back to
script mode: emit a portable bash script the customer runs from any existing EC2 in the VPC, no Quilt deploy required
- Recommend supporting both modes;
script is the escape hatch when run itself can't start.
Open questions
- Authoritative list of AWS services each Quilt component calls out to — needs confirmation from the platform team, not just grep.
- Best base image for the probe task that maximizes chance of starting in a partially-broken VPC.
- Do we want to also check outbound HTTPS to non-AWS services?
Acceptance criteria
Motivation
When a customer deploys a Quilt CloudFormation stack into a locked-down VPC (no Internet Gateway, no NAT, incomplete VPC endpoints), the stack hangs on ECS service creation because tasks can't pull images from ECR. The failure mode is silent — CloudFormation events don't say "ECR unreachable," and CloudWatch log groups may never even be created.
This came up on a Biogen support call (2026-05-15): we spent most of a 23-minute meeting diagnosing exactly this, with engineers guessing at which VPC endpoints were missing. Alexei committed to sending a reachability-test script, but it doesn't exist as a reusable tool. We should ship it as a
quiltx ecssubcommand so support can hand customers a one-liner.Belongs under
ecsbecause the test must run from inside an ECS task in the customer's VPC — that's the exact network context where the failure occurs, andquiltx/ecs.pyalready has the task-launching plumbing (run_task,wait_for_task,get_network_config).Related:
docs/advanced-features/private-endpoint-access.mdin the quilt repo needs to link to this once it exists.Proposed UX
Output: a table of
service → endpoint → reachable? → resolved IP (public/private)so the customer can see exactly which AWS services their VPC cannot reach.What it should check
From the call and a read of the CFN template, at minimum:
api.ecr.<region>.amazonaws.com)*.dkr.ecr.<region>.amazonaws.com) — image pullsApiGatewayVPCEndpointIdis a stack param)The exact list should be derived from the CFN template, not hardcoded — open question below.
Implementation sketch
run_task/get_network_confighelpers inquiltx/ecs.pyto launch a short-lived task in the target VPC/subnet that runs DNS lookups + TCP connect tests against each service endpoint, then returns JSON via logs.public.ecr.aws/amazonlinux/amazonlinux) — still needs egress, so doesn't help in the worst casescriptmode: emit a portable bash script the customer runs from any existing EC2 in the VPC, no Quilt deploy requiredscriptis the escape hatch whenrunitself can't start.Open questions
Acceptance criteria
quiltx ecs reachability --stack <name>runs end-to-end against a real private-VPC deploymentquiltx ecs reachability scriptemits a self-contained bash script that runs on any Linux EC2 in the VPCdocs/advanced-features/private-endpoint-access.md(separate PR in quilt repo)