Skip to content

More OTel metrics#4822

Merged
robinjam merged 5 commits intomainfrom
jar-otel-metrics
May 7, 2026
Merged

More OTel metrics#4822
robinjam merged 5 commits intomainfrom
jar-otel-metrics

Conversation

@robinjam
Copy link
Copy Markdown
Contributor

@robinjam robinjam commented Apr 14, 2026

https://trello.com/c/25A3gpXh/1358-replace-statsd-instrumentation-across-notify-with-otel

Probably best reviewed with whitespace changes hidden.

Adds the following metrics, which should hopefully replace the existing StatsD ones (the dashboards will be updated in a future PR).

The names are automatically normalised to follow the Prometheus naming convention by the OTel collector, so e.g. provider.request.duration becomes provider_request_duration_seconds. I have deliberately chosen names that do not conflict with the existing metrics that go via the StatsD exporter, so we can have both stacks running in parallel.


provider.request.duration {error.type, notification.type, provider.name}

Duration of (HTTP) requests to providers

Screenshot 2026-04-15 at 16 34 10

notification.sms.internationals {notification.status, notification.sms.country_code}

Number of international text messages sent

No screenshot for this one, but the behaviour should be identical to the existing notify_international_sms_total metric reported by StatsD.


notification.send.duration {error.type, key.type, notification.type, provider.name}

Elapsed time between notification creation and sending to provider

Screenshot 2026-04-15 at 17 04 01

notification.deliver.duration {key.type, notification.status, notification.type, provider.name}

Elapsed time between notification creation datetime and delivery datetime reported by provider

Screenshot 2026-04-15 at 16 35 24

@robinjam robinjam force-pushed the jar-otel-metrics branch 22 times, most recently from 1919c37 to af86a09 Compare April 16, 2026 09:51
@robinjam robinjam marked this pull request as ready for review April 17, 2026 15:33
@WesleyHindle WesleyHindle marked this pull request as draft April 20, 2026 07:05
@robinjam robinjam changed the title WIP: More OTel metrics More OTel metrics Apr 27, 2026
@robinjam robinjam marked this pull request as ready for review April 27, 2026 13:44
@robinjam
Copy link
Copy Markdown
Contributor Author

This is ready for review - the "WIP" in the title was a mistake

@risicle
Copy link
Copy Markdown
Member

risicle commented May 6, 2026

Might we be able to have international as a label in notification.deliver.duration? I realise it might be a bit of a big ask, because you've put the metric tallying before we've hit the db to find that out...

@risicle
Copy link
Copy Markdown
Member

risicle commented May 6, 2026

This made me think quite a bit about notification receipts, duplicate notification receipts and etcetera. I think you've made the right choice, but people using the metrics are going to need to understand that the statuses aren't "exclusive" (i.e. one notification will tally for both pending and e.g. permanent-failure). And using "delivery time" should hopefully nullify most weirdness that results from receipts making it to the front of the queue out-of-order. Assuming we can trust it.

Would it be paranoid to have another metric here in parallel using receipt_dt instead?

@robinjam robinjam force-pushed the jar-otel-metrics branch from daee79d to 382e7e8 Compare May 6, 2026 17:33
@robinjam robinjam force-pushed the jar-otel-metrics branch from 382e7e8 to c6da4ff Compare May 6, 2026 17:36
@robinjam
Copy link
Copy Markdown
Contributor Author

robinjam commented May 6, 2026

Might we be able to have international as a label in notification.deliver.duration

Good shout, done. I've left it off notification.send.duration because I can't imagine the international-ness being relevant there

people using the metrics are going to need to understand that the statuses aren't "exclusive" (i.e. one notification will tally for both pending and e.g. permanent-failure).

Also a good point, I've updated the descriptions on both notification.sms.internationals and notification.delivery.duration. Weirdly only firetext notifications seem to get a "pending" callback, at least in the dev envs.

Would it be paranoid to have another metric here in parallel using receipt_dt instead?

Hmmmmm... Maybe, not sure what we'd call it though. notification.delivery_callback.duration?

I suppose the name doesn't matter that much as long as the description makes sense.

@risicle
Copy link
Copy Markdown
Member

risicle commented May 7, 2026

Weirdly only firetext notifications seem to get a "pending" callback, at least in the dev envs.

I think you're right, looking for "duplicate callback warning" log messages I only saw them from SES and firetext, and the firetext outcomes panel in the "Notify Summary" grafana dashboard has a little tooltip alluding to this.

Copy link
Copy Markdown
Member

@risicle risicle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Up to you on the parallel metric.

Changes the instrumentation of `notification.deliver.duration` so that
it does not fall back to the current time if `delivery_dt` is unset, and
adds a new metric `notification.callback.duration` based on the
timestamp at the moment we received the delivery receipt.

This makes the delivery receipt metric more honest while giving us a
backup metric if we decide we don't trust the delivery timestamps as
reported by a provider.
@robinjam robinjam merged commit 11caeff into main May 7, 2026
10 checks passed
@robinjam robinjam deleted the jar-otel-metrics branch May 7, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants