Add Cosmo sequencer diagnosis engine by mkeeter · Pull Request #2436 · oxidecomputer/hubris

mkeeter · 2026-03-13T14:18:11Z

This roughly implements the A0 Sequencing Fault Tree, with the goal of logging a single actionable line in a ringbuf upon sequencing failure:

 NDX LINE      GEN    COUNT PAYLOAD
   0  552        1        1 Diagnosis { now_ms: 0x54195, details: WaitingForSlpCheckpoint { why: Sp5StuckInS5Sleep } }

It also dumps the full register state at the same time, in a separate RAW ringbuf (for ease of reading).

mkeeter · 2026-03-13T14:18:42Z

drv/cosmo-seq-server/src/diagnose.rs

+
+#[derive(Copy, Clone, PartialEq, counters::Count)]
+pub(crate) enum WhyWaitingForGroupA {
+    RailIssue(#[count(children)] RailIssue, GroupARail),


We can only count one child here; any strong opinions on whether it's better to count the RailIssue or the GroupARail?

mkeeter · 2026-03-13T14:34:24Z

closes #2372 and #2398

labbott · 2026-03-13T19:20:54Z

app/cosmo/base.toml

+max-sizes = {flash = 131072, ram = 32768 }
+stacksize = 3200


I'm having a very strong "why is this so big" reaction but I think the answer is that if we're going to be saving all the registers (which we want!) it's going to take up space.

Yeah, we're unpacking into the SomeRegisterView type, which goes from a u32 to a struct with a bunch of individual bool values – it's much easier to read in the ringbuf, but is necessarily larger!

it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...

@hawkw do counters show up in stack size? They're obviously allocated elsewhere, but dunno if they add significant expense to ringbuf_entry!.

They shouldn't make the stacks much bigger; I was referring to the RAM size increase.

drv/cosmo-seq-server/src/diagnose.rs

hawkw · 2026-03-13T20:59:10Z

drv/cosmo-seq-server/src/diagnose.rs

+#[derive(Copy, Clone, PartialEq, counters::Count)]
+#[allow(non_camel_case_types)]
+pub(crate) enum GroupARail {
+    V1P5_RTC,
+    V3P3_SP5_A1,
+    V1P8_SP5_A1,
+}
+
+#[derive(Copy, Clone, PartialEq, counters::Count)]
+#[allow(non_camel_case_types)]
+pub(crate) enum GroupBRail {
+    V1P1_SP5,
+}
+
+#[derive(Copy, Clone, PartialEq, counters::Count)]
+#[allow(non_camel_case_types)]
+pub(crate) enum Ddr5HscRail {
+    DDR5_ABCDEF,
+    DDR5_GHIJKL,
+}
+
+#[derive(Copy, Clone, PartialEq, counters::Count)]
+#[allow(non_camel_case_types)]
+pub(crate) enum GroupCRail {
+    VDDIO_SP5_A0,
+    VDDCR_CPU0,
+    VDDCR_CPU1,
+    VDDCR_SOC,
+}


hmm, elsewhere in the sequencer, i have written out rail names in enum variants in CamelCase, as in:

hubris/drv/cosmo-seq-server/src/vcore.rs

Lines 33 to 39 in 4b54fdb

#[derive(Copy, Clone, PartialEq, microcbor::Encode)]

pub(crate) enum Rail {

#[cbor(rename = "VDDCR_CPU0_A0")]

VddcrCpu0,

#[cbor(rename = "VDDCR_CPU1_A0")]

VddcrCpu1,

}

I have a mild preference against violating Rust's case/capitalization style guide, because I think it's helpful to be able to look at an identifier and know "this is a type/variable/constant", but...it's slightly less upsetting to me since these are enums variants and will be written out as e.g. GroupARail::V1P5_RTC and so on, so it's slightly less confusing. I dunno. In any case, I would have a slight preference for making this consistent with the vcore module...

My perspective is that the target audience is Benchmark techs / Oxide EEs, rather than Rust developers, so matching "what do you see on a schematic" is more valuable than matching Rust's usual style.

I'm willing to accept that argument! And, you might convince me to also rename the types in vcore...

hawkw · 2026-03-13T20:59:40Z

app/cosmo/base.toml

+max-sizes = {flash = 131072, ram = 32768 }
+stacksize = 3200


it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...

drv/cosmo-seq-server/src/diagnose.rs

hawkw · 2026-03-13T22:20:35Z

drv/cosmo-seq-server/src/main.rs

        }

        if ifr.thermtrip {
+            diagnose::run(&self.seq);


i can't immediately imagine why we would need to do this if we see a THERMTRIP: this is a pin which the host CPU asserts if T_ctl_ exceeds 115 tctl units. this causes the sequencer FPGA to immediately power down the A0 rails. i don't think we need to run diagnosis in that case, since we already know what has happened pretty unambiguously.

I was on the fence here – my initial thought was "just call diagnose::run if anything weird happens, so it's the one-stop-shop for diagnoses" – but I've since convinced myself otherwise.

hawkw · 2026-03-13T22:23:10Z

drv/cosmo-seq-server/src/main.rs

        }

        if ifr.smerr_assert {
+            diagnose::run(&self.seq);


similarly, i am unconvinced that this would have anything interesting to say if SMERR_L is asserted — this is a bit that the host CPU sets under some error conditions, not an error from the sequencer FPGA. since i don't have a great understanding of what would make the host assert this, i think it makes more sense to run the diagnosis here than it does in the event of a THERMTRIP, but 🤷‍♀️

(same as above)

mkeeter · 2026-03-17T13:51:26Z

@hawkw
I've reworked this to pass in why diagnose::run is being called (the three cases are timeout when sequencing, MAPO detected, or surprise transition out of A0), and log a timestamp to correlate with the ringbufs in main. I think it's ready to go, with a set of ereports coming in a follow-up PR.

hawkw

Overall, this looks good to me, and I'm okay to defer adding ereports for this stuff to a follow-up branch. I had a few last nitpicks.

drv/cosmo-seq-server/src/main.rs

hawkw · 2026-03-17T17:22:01Z

drv/cosmo-seq-server/src/main.rs

        }

        if ifr.a0mapo {
+            let now = sys_get_timer().now;


there are now enough different branches in this function that all record a now timestamp that i kinda wonder if we wanna just always be unconditionally doing it at the top of the function, for code size reasons...not a huge deal either way though.

Turns out we already have a now from the top of the function, so I'll use it instead.

hawkw · 2026-03-17T17:28:45Z

drv/cosmo-seq-server/src/main.rs


                if !okay {
+                    // Log a fault diagnosis in the ringbuf
+                    diagnose::run(


this is nitpicky and perhaps not important but i wonder somewhat if either the module or the function ought to have a name that's like a0_fault_diagnosis() or diagnose::a0_fault() or something? i realize that's quite a bit wordier than run(), but maybe it's worth communicating to a reader that this is specifically walking through the fault tree for going to A0? i was wondering about this because i was imagining we might want to add a similar thing for NIC MAPO faults eventually...

Sure, changed to a0_fault()

hawkw · 2026-03-17T17:31:42Z

drv/cosmo-seq-server/src/diagnose.rs

+    Diagnosis {
+        now_ms: u64,
+        reason: DiagnoseReason,
+        #[count(children)]
+        details: Diagnosis,
+    },


i wondered if we might be able to make the size of the ringbuf entry a bit smaller by putting the timestamp and/or diagnosis reason in a separate entry from the details, which I think is probably big-ish. i feel like i am always a bit paranoid about this as the size of the largest entry is multiplied by the number of entries in the ringbuf...which in this case is just 8, so it's not that huge. but, it might also be useful to record the timestamp and reason separately so that we are also logging it in the case of a diagnosis failure --- i can imagine wishing we knew the timestamp/reason when we see a BadStateCombination or UnknownHwSmState...?

not a big deal anyway.

I'm going to punt on this – now that I've moved register logging to a separate ringbuf, it's nice to have everything in a single line

NDX LINE GEN COUNT PAYLOAD 0 565 1 1 Diagnosis { now: 0x421c, reason: MapoDetected, details: StuckInIdle { why: FanHscNotPg(East), a0_en: true } } 1 565 1 1 Diagnosis { now: 0xd029, reason: FailedToSequence, details: WaitingForSlpCheckpoint { why: Sp5StuckInS5Sleep } }

fine with me, no worries!

mkeeter requested review from hawkw and nathanaelhuffman March 13, 2026 14:18

mkeeter commented Mar 13, 2026

View reviewed changes

mkeeter force-pushed the mkeeter/cosmo-seq-diagnose branch from af9cf85 to d3a78d0 Compare March 13, 2026 14:34

labbott reviewed Mar 13, 2026

View reviewed changes

hawkw reviewed Mar 13, 2026

View reviewed changes

mkeeter added 7 commits March 17, 2026 09:52

Finish plumbing sequencer diagnosis engine

895987a

Add hooks for diagnosis engine

d373966

Check DDR5 PGs in MAPO; reorganize

3c692a2

Log less in diagnose, put everything in RAW

cec222e

Clippy fixes

e535c71

Don't count Trace::None

6f148d1

Review notes

827a71c

mkeeter force-pushed the mkeeter/cosmo-seq-diagnose branch from 695b3d2 to 827a71c Compare March 17, 2026 13:52

hawkw approved these changes Mar 17, 2026

View reviewed changes

More PR fixes

7e9b73c

mkeeter merged commit be9f71b into master Mar 17, 2026
180 checks passed

mkeeter deleted the mkeeter/cosmo-seq-diagnose branch March 17, 2026 19:10

	#[derive(Copy, Clone, PartialEq, microcbor::Encode)]
	pub(crate) enum Rail {
	#[cbor(rename = "VDDCR_CPU0_A0")]
	VddcrCpu0,
	#[cbor(rename = "VDDCR_CPU1_A0")]
	VddcrCpu1,
	}

Conversation

mkeeter commented Mar 13, 2026

Uh oh!

mkeeter Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkeeter commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkeeter commented Mar 17, 2026

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkeeter Mar 13, 2026 •

edited

Loading