Skip to content

Add Cosmo sequencer diagnosis engine#2436

Merged
mkeeter merged 8 commits intomasterfrom
mkeeter/cosmo-seq-diagnose
Mar 17, 2026
Merged

Add Cosmo sequencer diagnosis engine#2436
mkeeter merged 8 commits intomasterfrom
mkeeter/cosmo-seq-diagnose

Conversation

@mkeeter
Copy link
Collaborator

@mkeeter mkeeter commented Mar 13, 2026

This roughly implements the A0 Sequencing Fault Tree, with the goal of logging a single actionable line in a ringbuf upon sequencing failure:

 NDX LINE      GEN    COUNT PAYLOAD
   0  552        1        1 Diagnosis { now_ms: 0x54195, details: WaitingForSlpCheckpoint { why: Sp5StuckInS5Sleep } }

It also dumps the full register state at the same time, in a separate RAW ringbuf (for ease of reading).


#[derive(Copy, Clone, PartialEq, counters::Count)]
pub(crate) enum WhyWaitingForGroupA {
RailIssue(#[count(children)] RailIssue, GroupARail),
Copy link
Collaborator Author

@mkeeter mkeeter Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only count one child here; any strong opinions on whether it's better to count the RailIssue or the GroupARail?

@mkeeter mkeeter force-pushed the mkeeter/cosmo-seq-diagnose branch from af9cf85 to d3a78d0 Compare March 13, 2026 14:34
@mkeeter
Copy link
Collaborator Author

mkeeter commented Mar 13, 2026

closes #2372 and #2398

Comment on lines +182 to +183
max-sizes = {flash = 131072, ram = 32768 }
stacksize = 3200
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a very strong "why is this so big" reaction but I think the answer is that if we're going to be saving all the registers (which we want!) it's going to take up space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're unpacking into the SomeRegisterView type, which goes from a u32 to a struct with a bunch of individual bool values – it's much easier to read in the ringbuf, but is necessarily larger!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hawkw do counters show up in stack size? They're obviously allocated elsewhere, but dunno if they add significant expense to ringbuf_entry!.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't make the stacks much bigger; I was referring to the RAM size increase.

Comment on lines +183 to +211
#[derive(Copy, Clone, PartialEq, counters::Count)]
#[allow(non_camel_case_types)]
pub(crate) enum GroupARail {
V1P5_RTC,
V3P3_SP5_A1,
V1P8_SP5_A1,
}

#[derive(Copy, Clone, PartialEq, counters::Count)]
#[allow(non_camel_case_types)]
pub(crate) enum GroupBRail {
V1P1_SP5,
}

#[derive(Copy, Clone, PartialEq, counters::Count)]
#[allow(non_camel_case_types)]
pub(crate) enum Ddr5HscRail {
DDR5_ABCDEF,
DDR5_GHIJKL,
}

#[derive(Copy, Clone, PartialEq, counters::Count)]
#[allow(non_camel_case_types)]
pub(crate) enum GroupCRail {
VDDIO_SP5_A0,
VDDCR_CPU0,
VDDCR_CPU1,
VDDCR_SOC,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, elsewhere in the sequencer, i have written out rail names in enum variants in CamelCase, as in:

#[derive(Copy, Clone, PartialEq, microcbor::Encode)]
pub(crate) enum Rail {
#[cbor(rename = "VDDCR_CPU0_A0")]
VddcrCpu0,
#[cbor(rename = "VDDCR_CPU1_A0")]
VddcrCpu1,
}

I have a mild preference against violating Rust's case/capitalization style guide, because I think it's helpful to be able to look at an identifier and know "this is a type/variable/constant", but...it's slightly less upsetting to me since these are enums variants and will be written out as e.g. GroupARail::V1P5_RTC and so on, so it's slightly less confusing. I dunno. In any case, I would have a slight preference for making this consistent with the vcore module...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My perspective is that the target audience is Benchmark techs / Oxide EEs, rather than Rust developers, so matching "what do you see on a schematic" is more valuable than matching Rust's usual style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to accept that argument! And, you might convince me to also rename the types in vcore...

Comment on lines +182 to +183
max-sizes = {flash = 131072, ram = 32768 }
stacksize = 3200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...

}

if ifr.thermtrip {
diagnose::run(&self.seq);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't immediately imagine why we would need to do this if we see a THERMTRIP: this is a pin which the host CPU asserts if T_ctl_ exceeds 115 tctl units. this causes the sequencer FPGA to immediately power down the A0 rails. i don't think we need to run diagnosis in that case, since we already know what has happened pretty unambiguously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence here – my initial thought was "just call diagnose::run if anything weird happens, so it's the one-stop-shop for diagnoses" – but I've since convinced myself otherwise.

}

if ifr.smerr_assert {
diagnose::run(&self.seq);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, i am unconvinced that this would have anything interesting to say if SMERR_L is asserted — this is a bit that the host CPU sets under some error conditions, not an error from the sequencer FPGA. since i don't have a great understanding of what would make the host assert this, i think it makes more sense to run the diagnosis here than it does in the event of a THERMTRIP, but 🤷‍♀️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as above)

@mkeeter
Copy link
Collaborator Author

mkeeter commented Mar 17, 2026

@hawkw
I've reworked this to pass in why diagnose::run is being called (the three cases are timeout when sequencing, MAPO detected, or surprise transition out of A0), and log a timestamp to correlate with the ringbufs in main. I think it's ready to go, with a set of ereports coming in a follow-up PR.

@mkeeter mkeeter force-pushed the mkeeter/cosmo-seq-diagnose branch from 695b3d2 to 827a71c Compare March 17, 2026 13:52
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me, and I'm okay to defer adding ereports for this stuff to a follow-up branch. I had a few last nitpicks.

}

if ifr.a0mapo {
let now = sys_get_timer().now;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are now enough different branches in this function that all record a now timestamp that i kinda wonder if we wanna just always be unconditionally doing it at the top of the function, for code size reasons...not a huge deal either way though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we already have a now from the top of the function, so I'll use it instead.


if !okay {
// Log a fault diagnosis in the ringbuf
diagnose::run(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nitpicky and perhaps not important but i wonder somewhat if either the module or the function ought to have a name that's like a0_fault_diagnosis() or diagnose::a0_fault() or something? i realize that's quite a bit wordier than run(), but maybe it's worth communicating to a reader that this is specifically walking through the fault tree for going to A0? i was wondering about this because i was imagining we might want to add a similar thing for NIC MAPO faults eventually...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed to a0_fault()

Comment on lines +25 to +30
Diagnosis {
now_ms: u64,
reason: DiagnoseReason,
#[count(children)]
details: Diagnosis,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wondered if we might be able to make the size of the ringbuf entry a bit smaller by putting the timestamp and/or diagnosis reason in a separate entry from the details, which I think is probably big-ish. i feel like i am always a bit paranoid about this as the size of the largest entry is multiplied by the number of entries in the ringbuf...which in this case is just 8, so it's not that huge. but, it might also be useful to record the timestamp and reason separately so that we are also logging it in the case of a diagnosis failure --- i can imagine wishing we knew the timestamp/reason when we see a BadStateCombination or UnknownHwSmState...?

not a big deal anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to punt on this – now that I've moved register logging to a separate ringbuf, it's nice to have everything in a single line

 NDX LINE      GEN    COUNT PAYLOAD
   0  565        1        1 Diagnosis { now: 0x421c, reason: MapoDetected, details: StuckInIdle { why: FanHscNotPg(East), a0_en: true } }
   1  565        1        1 Diagnosis { now: 0xd029, reason: FailedToSequence, details: WaitingForSlpCheckpoint { why: Sp5StuckInS5Sleep } }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me, no worries!

@mkeeter mkeeter merged commit be9f71b into master Mar 17, 2026
180 checks passed
@mkeeter mkeeter deleted the mkeeter/cosmo-seq-diagnose branch March 17, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants