Skip to content

Metrics for crucible disks#1073

Open
leftwo wants to merge 3 commits intomasterfrom
alan/cru-wants-oximeter
Open

Metrics for crucible disks#1073
leftwo wants to merge 3 commits intomasterfrom
alan/cru-wants-oximeter

Conversation

@leftwo
Copy link
Contributor

@leftwo leftwo commented Mar 9, 2026

Crucible disk metrics were missing because set_metric_consumer was called on a DeviceAttachment before Crucible's queues were associated with it. The original implementation only walked the already-associated queues and handed the consumer to each one's QueueMinder. Any queue that arrived late simply never received it.

The fix stores the MetricConsumer in QueueColState (the collection-level state guarded by the collection's mutex), then in queue_associate propagates it to any newly-arriving queue's minder before the minder is installed into the slot. This mirrors exactly how the paused flag is already propagated to late-associating queues.

A minor correctness fix: as_mut() -> as_ref() on the minder Option
was also included, since set_metric_consumer takes &self.

@jmcarp
Copy link

jmcarp commented Mar 11, 2026

I just ran into this myself. Looking forward to this landing so my dashboards will look nicer.

@leftwo
Copy link
Contributor Author

leftwo commented Mar 11, 2026

Fix for: #1077

@leftwo
Copy link
Contributor Author

leftwo commented Mar 11, 2026

Crucible disk metrics from Dublin running this repo:
Screenshot 2026-03-11 at 2 05 04 PM

@leftwo leftwo marked this pull request as ready for review March 11, 2026 21:22
@leftwo
Copy link
Contributor Author

leftwo commented Mar 11, 2026

To answer a question @iximeow had, does the scrub count for these?

To answer, here we have a propolis server running a scrub:

23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub check for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122               
23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub pause 120 seconds before starting                            
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122 begins              
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub with total_size:85899345920 block_size:512                   
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrubs from block 0 to 167772160 in (256) 131072 size IOs pm:25 

So, we have a read only parent copying IO on a new blank disk.
We can see this in dtrace output, where we read from one volume and write to the other:

  PID     UUID  SESSION DS0 DS1 DS2   NEXT_JOB  DELTA CONN   ELR   ELC   ERR   ERN                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18359     37    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20817     37    3     0     0     0     0                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18397     38    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20854     37    3     0     0     0     0           

Looking in the console, we can see this:

Screenshot 2026-03-11 at 4 29 01 PM

We see the traffic from the initial boot, then IOs go to zero.
Given we are doing 1M IOs, we should see something in the metrics if the scrub was counted.

A few minutes later, we see a little traffic on the disk, but I suspect this is from whatever background activities the boot disk is logging. I don't see traffic here to indicate the scrub traffic is being counted.

Screenshot 2026-03-11 at 4 39 16 PM

Now I have question for me, and that is why are these not counted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants