Skip to content

want policy for handling time drift between machines in migration #357

@jordanhendricks

Description

@jordanhendricks

Part of ongoing work for #337

One aspect of handling time across live migrations is accounting for the delta between reading the timing data on the source side and writing out the time data on the destination side. In this step on the destination, the current implementation of #337 makes calculations based on the wall clock times of both machines. One extremely load-bearing assumption in using wall clocks to find the time delta is assuming that NTP is working properly, and that both machines' wall clocks are synchronized.

In my testing so far on a lab cluster, I have observed negative migration deltas (<2000 usecs) between the source read and the target receipt of the data. My current implementation throws up its hands if it sees a negative delta. The lab machine cluster I'm using uses a public NTP server (see also: https://github.com/oxidecomputer/meta/issues/146), and in production we will have a local NTP server with presumably a much tighter window for the wall clocks.

I'm still considering how to handle this properly (without somehow re-implementing NTP). My initial thought for now is to clamp any negative deltas perceived to 0 and log a warning if we see such a delta. This is incomplete, as a positive delta can also be far off. I want to do some more thinking about this, and I also want to do some exploration here about how what monitoring we will have for verifying NTP is working elsewhere in the software stack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    migrationIssues related to live migration.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions