vmm: keep virtio activation alive in migration#105
Open
Coffeeri wants to merge 1 commit intocyberus-technology:gardenlinuxfrom
Open
vmm: keep virtio activation alive in migration#105Coffeeri wants to merge 1 commit intocyberus-technology:gardenlinuxfrom
Coffeeri wants to merge 1 commit intocyberus-technology:gardenlinuxfrom
Conversation
|
Very nice debugging! |
a5d2b4e to
3365d39
Compare
|
@Coffeeri, can you please add a test in libvirt-tests that does a couple of migrations directly after the vm boots? |
phip1611
reviewed
Mar 11, 2026
9136261 to
4933eb1
Compare
7905117 to
4810202
Compare
Live migration can deadlock if the guest triggers a virtio device
activation while the migration worker owns the VM.
The failure shows up during boot and firmware, where the guest can
reset and reinitialize virtio devices while precopy is running. In the
failing case, the source log shows a pending virtio activation that
never completes:
8.115833s _virtio-pci-net_0: Needs activation; returning barrier
8.115854s vmm/src/vm.rs:464 -- Waiting for barrier
24.875452s Entering downtime phase
24.875481s stopping vcpu throttling thread
...
vCPU thread did not respond in 10ms to signal - retrying
vCPU thread did not respond in 20ms to signal - retrying
...
thread 'throttle-vcpu' (1029) panicked
...
Pause(Error signalling vCPUs: Timeout when waiting for signal
to be acknowledged)
The vCPU blocks on the activation barrier and never reaches the normal
pause checkpoint. Later, migration enters downtime and stops the vCPU
throttle thread. In the failing case, that thread is still inside a
CpuManager::pause() call, which waits for every vCPU to acknowledge
the signal. The blocked vCPU never does, so the pause times out.
The VMM already receives ActivateVirtioDevices events during
migration, but it only drains pending activations when self.vm is in
MaybeVmOwnership::Vmm. Once vm_send_migration() moves the Vm into the
migration worker, self.vm becomes MaybeVmOwnership::Migration and the
event handler no longer has a path to call activate_virtio_devices().
Fix this by storing the DeviceManager inside
MaybeVmOwnership::Migration. This keeps just enough state on the VMM
thread to drain pending virtio activations while the migration worker
owns the Vm. The barrier logic stays unchanged. The VMM now releases
the same activation barrier during migration that it already released
before migration started.
This keeps the guest from getting stuck in the activation wait and
lets the later pause succeed.
On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
4810202 to
77d800d
Compare
phip1611
reviewed
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fixes https://github.com/cobaltcore-dev/cobaltcore/issues/411
Live migration can deadlock if the guest triggers a virtio device
activation while the migration worker owns the VM.
The failure shows up during boot and firmware, where the guest can
reset and reinitialize virtio devices while precopy is running. In the
failing case, the source log shows a pending virtio activation that
never completes:
The vCPU blocks on the activation barrier and never reaches the normal
pause checkpoint. Later, migration enters downtime and stops the vCPU
throttle thread. In the failing case, that thread is still inside a
CpuManager::pause() call, which waits for every vCPU to acknowledge
the signal. The blocked vCPU never does, so the pause times out.
The VMM already receives ActivateVirtioDevices events during
migration, but it only drains pending activations when self.vm is in
MaybeVmOwnership::Vmm. Once vm_send_migration() moves the Vm into the
migration worker, self.vm becomes MaybeVmOwnership::Migration and the
event handler no longer has a path to call activate_virtio_devices().
Fix this by storing the DeviceManager inside
MaybeVmOwnership::Migration. This keeps just enough state on the VMM
thread to drain pending virtio activations while the migration worker
owns the Vm. The barrier logic stays unchanged. The VMM now releases
the same activation barrier during migration that it already released
before migration started.
This keeps the guest from getting stuck in the activation wait and
lets the later pause succeed.
See reproducer in libvirt-tests
Steps to un-draft