Proposal
Today the tsdb/agent truncates the the WAL segments based on the assumption,
|
// The lower two-thirds of segments should contain mostly obsolete samples. |
which is not verified before truncation occurs.
Due to this it's really hard to determine how much downtime can be tolerated in a remote write configuration since it's a factor of TruncateFrequency + rate of data in. This leads to a much higherTruncateFrequencythan is really necessary and much larger WALs that must be fully replayed on startup. Internally we run with a 15 minute interval as we are okay with trading off downtime tolerance for less memory usage.
I would propose remote.Storage expose the ability to subscribe to be notified when a segment changes. This would be called after all current queues have read past a segment, sample implementation.
After segment notifications are working the tsdb/agent could subscribe and truncate based on segments that have been fully read. At this point we could consider dropping the default TruncateFrequency allowing for smaller WALs.
I'm not sure how much of this, if any, is applicable for the tsdb proper
Proposal
Today the tsdb/agent truncates the the WAL segments based on the assumption,
prometheus/tsdb/agent/db.go
Line 676 in f50ff0a
Due to this it's really hard to determine how much downtime can be tolerated in a remote write configuration since it's a factor of
TruncateFrequency+ rate of data in. This leads to a much higherTruncateFrequencythan is really necessary and much larger WALs that must be fully replayed on startup. Internally we run with a 15 minute interval as we are okay with trading off downtime tolerance for less memory usage.I would propose
remote.Storageexpose the ability to subscribe to be notified when a segment changes. This would be called after all current queues have read past a segment, sample implementation.After segment notifications are working the tsdb/agent could subscribe and truncate based on segments that have been fully read. At this point we could consider dropping the default
TruncateFrequencyallowing for smaller WALs.I'm not sure how much of this, if any, is applicable for the tsdb proper