Skip to content

Add ecs reachability subcommand to diagnose private-VPC egress failures #52

@drernie

Description

@drernie

Motivation

When a customer deploys a Quilt CloudFormation stack into a locked-down VPC (no Internet Gateway, no NAT, incomplete VPC endpoints), the stack hangs on ECS service creation because tasks can't pull images from ECR. The failure mode is silent — CloudFormation events don't say "ECR unreachable," and CloudWatch log groups may never even be created.

This came up on a Biogen support call (2026-05-15): we spent most of a 23-minute meeting diagnosing exactly this, with engineers guessing at which VPC endpoints were missing. Alexei committed to sending a reachability-test script, but it doesn't exist as a reusable tool. We should ship it as a quiltx ecs subcommand so support can hand customers a one-liner.

Belongs under ecs because the test must run from inside an ECS task in the customer's VPC — that's the exact network context where the failure occurs, and quiltx/ecs.py already has the task-launching plumbing (run_task, wait_for_task, get_network_config).

Related: docs/advanced-features/private-endpoint-access.md in the quilt repo needs to link to this once it exists.

Proposed UX

quiltx ecs reachability --stack <stack-name>     # auto-discover VPC/subnets from stack
quiltx ecs reachability --vpc vpc-xxx --subnet subnet-yyy
quiltx ecs reachability script                   # emit portable bash script (no deploy needed)

Output: a table of service → endpoint → reachable? → resolved IP (public/private) so the customer can see exactly which AWS services their VPC cannot reach.

What it should check

From the call and a read of the CFN template, at minimum:

  • ECR API (api.ecr.<region>.amazonaws.com)
  • ECR DKR (*.dkr.ecr.<region>.amazonaws.com) — image pulls
  • S3 (gateway endpoint)
  • CloudWatch Logs
  • SNS
  • SQS
  • Secrets Manager
  • STS
  • API Gateway (interface endpoint, since ApiGatewayVPCEndpointId is a stack param)

The exact list should be derived from the CFN template, not hardcoded — open question below.

Implementation sketch

  • Reuse the existing run_task / get_network_config helpers in quiltx/ecs.py to launch a short-lived task in the target VPC/subnet that runs DNS lookups + TCP connect tests against each service endpoint, then returns JSON via logs.
  • Chicken-and-egg: if ECR is unreachable, the probe task itself can't pull its image. Options:
    • Use a public ECR image already cached in AWS (e.g., public.ecr.aws/amazonlinux/amazonlinux) — still needs egress, so doesn't help in the worst case
    • Fall back to script mode: emit a portable bash script the customer runs from any existing EC2 in the VPC, no Quilt deploy required
  • Recommend supporting both modes; script is the escape hatch when run itself can't start.

Open questions

  • Authoritative list of AWS services each Quilt component calls out to — needs confirmation from the platform team, not just grep.
  • Best base image for the probe task that maximizes chance of starting in a partially-broken VPC.
  • Do we want to also check outbound HTTPS to non-AWS services?

Acceptance criteria

  • quiltx ecs reachability --stack <name> runs end-to-end against a real private-VPC deployment
  • quiltx ecs reachability script emits a self-contained bash script that runs on any Linux EC2 in the VPC
  • Output clearly distinguishes "DNS resolves to private IP via endpoint" vs "DNS resolves to public IP (needs IGW/NAT)" vs "unreachable"
  • Documented in docs/advanced-features/private-endpoint-access.md (separate PR in quilt repo)
  • Service list sourced from CFN template, with a test that fails if the template adds a service the checker doesn't know about

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions