want policy for handling time drift between machines in migration

Part of ongoing work for #337 

One aspect of handling time across live migrations is accounting for the delta between reading the timing data on the source side and writing out the time data on the destination side. In this step on the destination, the current implementation of #337 makes calculations based on the wall clock times of both machines. One extremely load-bearing assumption in using wall clocks to find the time delta is assuming that NTP is working properly, and that both machines' wall clocks are synchronized. 

In my testing so far on a lab cluster, I have observed negative migration deltas (<2000 usecs) between the source read and the target receipt of the data. My current implementation throws up its hands if it sees a negative delta. The lab machine cluster I'm using uses a public NTP server (see also: https://github.com/oxidecomputer/meta/issues/146), and in production we will have a local NTP server with presumably a much tighter window for the wall clocks.

I'm still considering how to handle this properly (without somehow re-implementing NTP). My initial thought for now is to clamp any negative deltas perceived to 0 and log a warning if we see such a delta. This is incomplete, as a positive delta can also be far off. I want to do some more thinking about this, and I also want to do some exploration here about how what monitoring we will have for verifying NTP is working elsewhere in the software stack.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

want policy for handling time drift between machines in migration #357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

want policy for handling time drift between machines in migration #357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions