Skip to content

Add BATS test for podman-machine-start.sh supervisor loop #123

@smartwatermelon

Description

@smartwatermelon

Background

PR that fixes the VM-keepalive regression (commit b329667) converted podman-machine-start.sh from a one-shot into a long-running supervisor that health-checks the transmission-vm + transmission-vpn container on a 300s cadence. The fix was shellchecked and reviewed but has no automated regression coverage for the supervision state machine.

What to build

A BATS test at tests/podman-machine-start.bats that:

  1. Extracts the wrapper heredoc from app-setup/podman-transmission-setup.sh into a real file in $BATS_TMPDIR, rendering deploy-time placeholders (${HOMEBREW_PREFIX}, ${NFS_MOUNT_POINT}, ${HOST_PORT}, etc.) with test values. The tests/plex-watchdog.bats source_watchdog_functions() helper is a good pattern to mirror.
  2. Shims podman in PATH to a scripted mock that responds to podman machine inspect, podman info, podman container exists, podman inspect, podman rm, podman run with fixtured values — and logs every invocation to $BATS_TMPDIR/podman.calls so the test can assert call order.
  3. Runs the wrapper with SUPERVISE_INTERVAL=1 under timeout 5 (so the loop runs a couple of iterations then is killed), then asserts:
    • machine start was called exactly once when the initial state was stopped
    • machine start was not called on subsequent iterations when state is running
    • podman run -d ... transmission-vpn was called exactly once when the container was missing
    • podman run was not called again when the container existed + was running
    • The script did not exit on its own (the exit was caused by timeout)

Why it's worth it

The regression that prompted the original fix (vfkit dying 5 seconds after LaunchAgent exit) is a macOS launchd behavior we can't unit-test. But the mitigation — the supervisor loop and its state machine — has testable properties. A test that fails if someone accidentally converts this back to a one-shot, or breaks the idempotency of ensure_container, would catch the exact class of regression that caused the 2026-04-18 outage.

Why not now

The test is ~80 lines of BATS plus a podman mock. Unlike the config fix in (2), it's not a one-liner and doesn't block the current PR.

Pointers

  • Existing BATS pattern: tests/plex-watchdog.bats
  • Script heredoc: app-setup/podman-transmission-setup.sh:485 (<<WRAPPER)
  • Supervisor loop: generated script's final while true; do ... done block

Metadata

Metadata

Assignees

No one assigned

    Labels

    tech-debtTechnical debt to address

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions