Skip to content

vmm: keep virtio activation alive in migration#105

Open
Coffeeri wants to merge 1 commit intocyberus-technology:gardenlinuxfrom
Coffeeri:fix/migration-virtio-activation-barrier
Open

vmm: keep virtio activation alive in migration#105
Coffeeri wants to merge 1 commit intocyberus-technology:gardenlinuxfrom
Coffeeri:fix/migration-virtio-activation-barrier

Conversation

@Coffeeri
Copy link

@Coffeeri Coffeeri commented Mar 10, 2026

This fixes https://github.com/cobaltcore-dev/cobaltcore/issues/411

Live migration can deadlock if the guest triggers a virtio device
activation while the migration worker owns the VM.

The failure shows up during boot and firmware, where the guest can
reset and reinitialize virtio devices while precopy is running. In the
failing case, the source log shows a pending virtio activation that
never completes:

  8.115833s _virtio-pci-net_0: Needs activation; returning barrier
  8.115854s vmm/src/vm.rs:464 -- Waiting for barrier
  24.875452s Entering downtime phase
  24.875481s stopping vcpu throttling thread
  ...
  vCPU thread did not respond in 10ms to signal - retrying
  vCPU thread did not respond in 20ms to signal - retrying
  ...
  thread 'throttle-vcpu' (1029) panicked
  ...
  Pause(Error signalling vCPUs: Timeout when waiting for signal
        to be acknowledged)

The vCPU blocks on the activation barrier and never reaches the normal
pause checkpoint. Later, migration enters downtime and stops the vCPU
throttle thread. In the failing case, that thread is still inside a
CpuManager::pause() call, which waits for every vCPU to acknowledge
the signal. The blocked vCPU never does, so the pause times out.

The VMM already receives ActivateVirtioDevices events during
migration, but it only drains pending activations when self.vm is in
MaybeVmOwnership::Vmm. Once vm_send_migration() moves the Vm into the
migration worker, self.vm becomes MaybeVmOwnership::Migration and the
event handler no longer has a path to call activate_virtio_devices().

Fix this by storing the DeviceManager inside
MaybeVmOwnership::Migration. This keeps just enough state on the VMM
thread to drain pending virtio activations while the migration worker
owns the Vm. The barrier logic stays unchanged. The VMM now releases
the same activation barrier during migration that it already released
before migration started.

This keeps the guest from getting stuck in the activation wait and
lets the later pause succeed.

See reproducer in libvirt-tests

Steps to un-draft

@tpressure
Copy link

Very nice debugging!

@Coffeeri Coffeeri force-pushed the fix/migration-virtio-activation-barrier branch from a5d2b4e to 3365d39 Compare March 10, 2026 15:09
@Coffeeri Coffeeri self-assigned this Mar 11, 2026
@tpressure
Copy link

@Coffeeri, can you please add a test in libvirt-tests that does a couple of migrations directly after the vm boots?

@Coffeeri Coffeeri force-pushed the fix/migration-virtio-activation-barrier branch 3 times, most recently from 9136261 to 4933eb1 Compare March 12, 2026 07:26
@Coffeeri Coffeeri marked this pull request as ready for review March 12, 2026 07:45
@Coffeeri Coffeeri requested a review from arctic-alpaca March 12, 2026 07:45
@Coffeeri Coffeeri force-pushed the fix/migration-virtio-activation-barrier branch 2 times, most recently from 7905117 to 4810202 Compare March 12, 2026 08:42
@Coffeeri Coffeeri requested review from phip1611 and tpressure March 12, 2026 08:42
Live migration can deadlock if the guest triggers a virtio device
activation while the migration worker owns the VM.

The failure shows up during boot and firmware, where the guest can
reset and reinitialize virtio devices while precopy is running. In the
failing case, the source log shows a pending virtio activation that
never completes:

  8.115833s _virtio-pci-net_0: Needs activation; returning barrier
  8.115854s vmm/src/vm.rs:464 -- Waiting for barrier
  24.875452s Entering downtime phase
  24.875481s stopping vcpu throttling thread
  ...
  vCPU thread did not respond in 10ms to signal - retrying
  vCPU thread did not respond in 20ms to signal - retrying
  ...
  thread 'throttle-vcpu' (1029) panicked
  ...
  Pause(Error signalling vCPUs: Timeout when waiting for signal
        to be acknowledged)

The vCPU blocks on the activation barrier and never reaches the normal
pause checkpoint. Later, migration enters downtime and stops the vCPU
throttle thread. In the failing case, that thread is still inside a
CpuManager::pause() call, which waits for every vCPU to acknowledge
the signal. The blocked vCPU never does, so the pause times out.

The VMM already receives ActivateVirtioDevices events during
migration, but it only drains pending activations when self.vm is in
MaybeVmOwnership::Vmm. Once vm_send_migration() moves the Vm into the
migration worker, self.vm becomes MaybeVmOwnership::Migration and the
event handler no longer has a path to call activate_virtio_devices().

Fix this by storing the DeviceManager inside
MaybeVmOwnership::Migration. This keeps just enough state on the VMM
thread to drain pending virtio activations while the migration worker
owns the Vm. The barrier logic stays unchanged. The VMM now releases
the same activation barrier during migration that it already released
before migration started.

This keeps the guest from getting stuck in the activation wait and
lets the later pause succeed.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
@Coffeeri Coffeeri force-pushed the fix/migration-virtio-activation-barrier branch from 4810202 to 77d800d Compare March 13, 2026 08:56
Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants