Conversation
|
|
||
| #[derive(Copy, Clone, PartialEq, counters::Count)] | ||
| pub(crate) enum WhyWaitingForGroupA { | ||
| RailIssue(#[count(children)] RailIssue, GroupARail), |
There was a problem hiding this comment.
We can only count one child here; any strong opinions on whether it's better to count the RailIssue or the GroupARail?
af9cf85 to
d3a78d0
Compare
| max-sizes = {flash = 131072, ram = 32768 } | ||
| stacksize = 3200 |
There was a problem hiding this comment.
I'm having a very strong "why is this so big" reaction but I think the answer is that if we're going to be saving all the registers (which we want!) it's going to take up space.
There was a problem hiding this comment.
Yeah, we're unpacking into the SomeRegisterView type, which goes from a u32 to a struct with a bunch of individual bool values – it's much easier to read in the ringbuf, but is necessarily larger!
There was a problem hiding this comment.
it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...
There was a problem hiding this comment.
@hawkw do counters show up in stack size? They're obviously allocated elsewhere, but dunno if they add significant expense to ringbuf_entry!.
There was a problem hiding this comment.
They shouldn't make the stacks much bigger; I was referring to the RAM size increase.
| #[derive(Copy, Clone, PartialEq, counters::Count)] | ||
| #[allow(non_camel_case_types)] | ||
| pub(crate) enum GroupARail { | ||
| V1P5_RTC, | ||
| V3P3_SP5_A1, | ||
| V1P8_SP5_A1, | ||
| } | ||
|
|
||
| #[derive(Copy, Clone, PartialEq, counters::Count)] | ||
| #[allow(non_camel_case_types)] | ||
| pub(crate) enum GroupBRail { | ||
| V1P1_SP5, | ||
| } | ||
|
|
||
| #[derive(Copy, Clone, PartialEq, counters::Count)] | ||
| #[allow(non_camel_case_types)] | ||
| pub(crate) enum Ddr5HscRail { | ||
| DDR5_ABCDEF, | ||
| DDR5_GHIJKL, | ||
| } | ||
|
|
||
| #[derive(Copy, Clone, PartialEq, counters::Count)] | ||
| #[allow(non_camel_case_types)] | ||
| pub(crate) enum GroupCRail { | ||
| VDDIO_SP5_A0, | ||
| VDDCR_CPU0, | ||
| VDDCR_CPU1, | ||
| VDDCR_SOC, | ||
| } |
There was a problem hiding this comment.
hmm, elsewhere in the sequencer, i have written out rail names in enum variants in CamelCase, as in:
hubris/drv/cosmo-seq-server/src/vcore.rs
Lines 33 to 39 in 4b54fdb
I have a mild preference against violating Rust's case/capitalization style guide, because I think it's helpful to be able to look at an identifier and know "this is a type/variable/constant", but...it's slightly less upsetting to me since these are enums variants and will be written out as e.g. GroupARail::V1P5_RTC and so on, so it's slightly less confusing. I dunno. In any case, I would have a slight preference for making this consistent with the vcore module...
There was a problem hiding this comment.
My perspective is that the target audience is Benchmark techs / Oxide EEs, rather than Rust developers, so matching "what do you see on a schematic" is more valuable than matching Rust's usual style.
There was a problem hiding this comment.
I'm willing to accept that argument! And, you might convince me to also rename the types in vcore...
| max-sizes = {flash = 131072, ram = 32768 } | ||
| stacksize = 3200 |
There was a problem hiding this comment.
it looks like it's also generating a pretty huge counters table for counting all the diagnosis types and subtypes...
drv/cosmo-seq-server/src/main.rs
Outdated
| } | ||
|
|
||
| if ifr.thermtrip { | ||
| diagnose::run(&self.seq); |
There was a problem hiding this comment.
i can't immediately imagine why we would need to do this if we see a THERMTRIP: this is a pin which the host CPU asserts if T_ctl_ exceeds 115 tctl units. this causes the sequencer FPGA to immediately power down the A0 rails. i don't think we need to run diagnosis in that case, since we already know what has happened pretty unambiguously.
There was a problem hiding this comment.
I was on the fence here – my initial thought was "just call diagnose::run if anything weird happens, so it's the one-stop-shop for diagnoses" – but I've since convinced myself otherwise.
drv/cosmo-seq-server/src/main.rs
Outdated
| } | ||
|
|
||
| if ifr.smerr_assert { | ||
| diagnose::run(&self.seq); |
There was a problem hiding this comment.
similarly, i am unconvinced that this would have anything interesting to say if SMERR_L is asserted — this is a bit that the host CPU sets under some error conditions, not an error from the sequencer FPGA. since i don't have a great understanding of what would make the host assert this, i think it makes more sense to run the diagnosis here than it does in the event of a THERMTRIP, but 🤷♀️
|
@hawkw |
695b3d2 to
827a71c
Compare
hawkw
left a comment
There was a problem hiding this comment.
Overall, this looks good to me, and I'm okay to defer adding ereports for this stuff to a follow-up branch. I had a few last nitpicks.
drv/cosmo-seq-server/src/main.rs
Outdated
| } | ||
|
|
||
| if ifr.a0mapo { | ||
| let now = sys_get_timer().now; |
There was a problem hiding this comment.
there are now enough different branches in this function that all record a now timestamp that i kinda wonder if we wanna just always be unconditionally doing it at the top of the function, for code size reasons...not a huge deal either way though.
There was a problem hiding this comment.
Turns out we already have a now from the top of the function, so I'll use it instead.
drv/cosmo-seq-server/src/main.rs
Outdated
|
|
||
| if !okay { | ||
| // Log a fault diagnosis in the ringbuf | ||
| diagnose::run( |
There was a problem hiding this comment.
this is nitpicky and perhaps not important but i wonder somewhat if either the module or the function ought to have a name that's like a0_fault_diagnosis() or diagnose::a0_fault() or something? i realize that's quite a bit wordier than run(), but maybe it's worth communicating to a reader that this is specifically walking through the fault tree for going to A0? i was wondering about this because i was imagining we might want to add a similar thing for NIC MAPO faults eventually...
There was a problem hiding this comment.
Sure, changed to a0_fault()
| Diagnosis { | ||
| now_ms: u64, | ||
| reason: DiagnoseReason, | ||
| #[count(children)] | ||
| details: Diagnosis, | ||
| }, |
There was a problem hiding this comment.
i wondered if we might be able to make the size of the ringbuf entry a bit smaller by putting the timestamp and/or diagnosis reason in a separate entry from the details, which I think is probably big-ish. i feel like i am always a bit paranoid about this as the size of the largest entry is multiplied by the number of entries in the ringbuf...which in this case is just 8, so it's not that huge. but, it might also be useful to record the timestamp and reason separately so that we are also logging it in the case of a diagnosis failure --- i can imagine wishing we knew the timestamp/reason when we see a BadStateCombination or UnknownHwSmState...?
not a big deal anyway.
There was a problem hiding this comment.
I'm going to punt on this – now that I've moved register logging to a separate ringbuf, it's nice to have everything in a single line
NDX LINE GEN COUNT PAYLOAD
0 565 1 1 Diagnosis { now: 0x421c, reason: MapoDetected, details: StuckInIdle { why: FanHscNotPg(East), a0_en: true } }
1 565 1 1 Diagnosis { now: 0xd029, reason: FailedToSequence, details: WaitingForSlpCheckpoint { why: Sp5StuckInS5Sleep } }
This roughly implements the A0 Sequencing Fault Tree, with the goal of logging a single actionable line in a ringbuf upon sequencing failure:
It also dumps the full register state at the same time, in a separate
RAWringbuf (for ease of reading).