Skip to content

Fix timing issue in process routing function#122

Open
mohamed-dek1 wants to merge 7 commits intofeature/dcp-amds-firmwarefrom
feature/dcp-amds-firmware-routing-timing
Open

Fix timing issue in process routing function#122
mohamed-dek1 wants to merge 7 commits intofeature/dcp-amds-firmwarefrom
feature/dcp-amds-firmware-routing-timing

Conversation

@mohamed-dek1
Copy link
Copy Markdown
Contributor

Closes #121

Notes

Anything reviewers should be aware of when reviewing? Other related issues? Known problems? Future work?

Self-Review

In this section, please self-review (answer all questions) on a suitable review checklist prior to requesting review from others. Select a review checklist based on what content is being merged in; see the Review Checklists section.

Reviewer Instructions

  1. Are all files under 300 kB (if not, please carefully assess whether it is worth committing them)? Yes
  2. Are all files named according to the appropriate naming convention, i.e., dash-case, camelCase, snake case? Yes
  3. Do all Markdown files follow the CONTRIBUTING article template? Yes
  4. Do all links work in the material that the PR is adding? Yes
  5. Is the PR configured to close the correct issue(s)? Yes
  6. Did the PR fully address the Approach section of the issue(s) it is closing? Yes

Appendix

This section should be the same for all PRs. Do not edit this section when creating a PR.

Review Checklists

Checklists maintained by the eLev lab for research repositories include:

Standard checklist

1. Are all files under 300 kB (if not, please carefully assess whether it is worth committing them)? **Yes or No**
2. Are all files named according to the appropriate [naming convention](https://github.com/Severson-Group/research-repo-template?tab=readme-ov-file#file-naming), i.e., dash-case, camelCase, snake case? **Yes or No**
3. Do all Markdown files follow the [CONTRIBUTING article template](https://github.com/Severson-Group/.github/blob/main/CONTRIBUTING.md#markdown-documentation-template)? **Yes or No**
4. Do all links work in the material that the PR is adding? **Yes or No**
5. Is the PR configured to close the correct issue(s)? **Yes or No**
6. Did the PR fully address the `Approach` section of the issue(s) it is closing? **Yes or No**

Please work on addressing any **No** items.

@mohamed-dek1 mohamed-dek1 requested a review from elsevers May 6, 2026 16:37
@mohamed-dek1 mohamed-dek1 self-assigned this May 6, 2026
@mohamed-dek1
Copy link
Copy Markdown
Contributor Author

In the captures below, Channel 1 is toggled every time the process routing function is called. From what I've seen, it's never leaving the function, and all the daisy-chained data is getting sent out through one function call. The problem is that we're calling it too fast, and when we get to the conditional for the dual-stream fast path, it isn't true. This forces us into the single-stream fast path or, even worse, the fragmented path.

Before any additional delay

image

Delay in the interrupt handler before process_routing

image

Additional delay at the end of dual-stream fast path

image

@elsevers
Copy link
Copy Markdown
Contributor

elsevers commented May 6, 2026

@mohamed-dek1 -- I think this should be a draft because it is not ready for review and merging, right?

@elsevers elsevers marked this pull request as draft May 6, 2026 20:10
@mohamed-dek1
Copy link
Copy Markdown
Contributor Author

@elsevers, I've replaced the delays with timeouts and some additional conditionals. There are still timed-out bytes, but it's about 1 in 15kB. I'm not sure what else can be done to improve this. This PR is read for review.

@mohamed-dek1 mohamed-dek1 marked this pull request as ready for review May 6, 2026 22:09
@elsevers
Copy link
Copy Markdown
Contributor

elsevers commented May 7, 2026

Thanks @mohamed-dek1. Your logic seems okay.

Need to debug this

We need to figure out what's going on and leading to the timed-out bytes though. It needs to be 0 bytes. We can't be dropping bytes.

Are you seeing anything in your log to help you figure out the root cause?

1.

I recommend starting by toggling your IO pin to see if any of these break statements are ever getting used in process_routing():

// Break if we reach the 2us timeout
if ((DWT->CYCCNT - start_cycles) > wait_cycles) {
    break;
}

2.

If not: then I would look to see if you ever end up on one of the single stream fast paths or the slow path.

3.

If not that, then I'd look to see if you ever end up calling try_process_routing() from main. I.e., toggle your pin if this conditional returns true:

if (drv_uart_has_dma_data())
     try_process_routing(); // This try function is thread safe

Need to check our timing

Before merging, we need to report out on what the new timing is to send the complete daisy chain data to the AMDC

@mohamed-dek1
Copy link
Copy Markdown
Contributor Author

Increasing the start bit timeout has fixed one of the lines, but the other is still timing out. The timeout below that's in the firmware is occasionally getting called, and it rarely falls through to the slower paths. When the timeout happens, it sometimes will exit the function and continue when it gets called again in the main loop.

// Break if we reach the 2us timeout
if ((DWT->CYCCNT - start_cycles) > wait_cycles) {
    break;
}

Slow path

image

Counters

image

@mohamed-dek1
Copy link
Copy Markdown
Contributor Author

One of the last issues with the firmware now is a parity problem on one of the UART lines. Tried debugging today and wasn't able to fix it. Seems to only happen on one line and has something to do with the waits.

image image image

@elsevers
Copy link
Copy Markdown
Contributor

elsevers commented May 9, 2026

I have been thinking over what we went through in lab yesterday, and am still in the camp of being a firmware timing and interrupt race condition issue as opposed to a hardware problem.

I have detailed questions and a plan below. In these, I am referring to your daisy chain hardware chain using this nomenclature:

AMDC <-- FBC1 <-- FBC2 <-- FBC3

Clarifications

  1. Overrun vs. Parity: I understand that Parity (PE), Framing (FE), and Overrun Errors (ORE) all trigger the same USART interrupt. Can you confirm that this is indeed a parity error that you are seeing? Keep in mind that when the MCU stops for a breakpoint, if another FBC is still sending it data, this may trigger overrun errors...
  2. I understand that most of your debugging with breakpoints was in FBC2 and that in here you concluded that the parity error was occurring only on one of the USARTs and only in the last 3 bytes of data. These are actually the same statement, since FBC2 is only receiving 1 packet on each USART. So... which USART is having the problem? And, how does this statement change / stay the same for FBC1, where now each USART is receiving 2 packets of data? (if you don't know the answer right now, that is okay, you'll figure it out in the "next steps" below)
  3. The role of drv_uart_wait_TC():
    • TXE (Transmit Data Register Empty) means we can load the next byte and therefore it is safe to call drv_uart_putc_fast repeatedly, with no delays (because it waits for TXE).
    • TC (Transmission Complete) means the shift register is completely empty and the physical line is idle.
    • We only ever need to wait for TC if we are disabling the UART or changing pin directions. Calling it between bytes (or even at the end of a transmission when we aren't disabling the UART) forces unnecessary idle gaps on the wire. That is exactly why your link became super slow and timed out when you moved it into putc_fast().
    • This means we should never call drv_uart_wait_TC() because we are never disabling the usarts.

Next steps

1. Clean up drv_uart_wait_TC()
Remove all calls to drv_uart_wait_TC() in the firmware. I don't think we should ever be calling this function since we aren't disabling our USARTs. Update the function definition with this comment so we don't trip over it again:

static inline void drv_uart_wait_TC(USART_TypeDef *uart)
{
    // ONLY USE THIS IF DISABLING THE UART OR GOING TO SLEEP!
    // This function waits for all data to be sent from the USART 
    //    (it waits for both the TDR and the Shift Register to be 
    //     completely empty)
    // 
    // Do NOT USE THIS during normal continuous data transmission
    //       as it will add significant delays
    while (!(uart->ISR & UART_FLAG_TC)) {
        asm("nop");
    }
}

2. Verify the Error Type
The next time you hit that error interrupt (without breakpoints messing with the timing), check the bits in the USARTx->ISR register. Is it actually the PE bit, or is it the ORE (Overrun) bit or something else?

3. Setup a Controlled Test Environment
Let's set up your test hardware to be the simplest possible demonstration of the problem. I'd like to see you do this: AMDC <-- FBC1 <-- FBC2 (no FBC3!!)

  • Put FBC2 into benchmark mode where it pretends to have an FBC3 attached. It should just immediately blast 2x packets (6 bytes) of dummy data out both USARTs whenever it receives the CNVRT start signal from FBC1.
  • In FBC1, set appropriate breakpoints or use pin toggling to identify any parity errors.

4. Analyze with the Saleae
Carefully construct a timing diagram of the data between the AMDC and FBC1, and between FBC1 and FBC2. We should see a very well-behaved byte flow with no unexpected gaps and minimal total transfer time.

  • If there are still parity issues: Confirm the repeatability. Is it really always on only one USART? Is it still only in the last 3 bytes?
  • If there are no parity errors: GREAT! See if there are timed out bytes reported by the AMDC and focus on fixing those.

5. Verify Cables & Full System Test
If the byte flow is perfectly behaved with no gaps or errors, FANTASTIC! Now swap out the custom high-density DB15 cable between the FBCs with the other custom cables we ordered to confirm they are all reliable without parity or timeout issues. Finally, get rid of the benchmark code, add FBC3 back into the chain, and test the full system.

@elsevers elsevers linked an issue May 11, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix issue with fragmented process_routing function

2 participants