Skip to content

[Grafana][PyTorch DevX] Fix viable/strict lag panel#645

Merged
jathu merged 2 commits into
mainfrom
jathu/grafana--pytorch-devx--changes
May 27, 2026
Merged

[Grafana][PyTorch DevX] Fix viable/strict lag panel#645
jathu merged 2 commits into
mainfrom
jathu/grafana--pytorch-devx--changes

Conversation

@jathu
Copy link
Copy Markdown
Collaborator

@jathu jathu commented May 27, 2026

Why the previous panel was wrong

The previous panel never actually compared viable/strict to main. It looked at push events on viable/strict and computed the gap between the push event time and the oldest commit it included. Specifically "how long the oldest commit included in each push had been sitting around before the push event fired". It has no relationship to how far behind main viable/strict is.

What the new query does

The new query measures the intended metric: each time viable/strict is updated, how many minutes behind main is its new head commit.

How it works:

  1. viable CTE: every push event to refs/heads/viable/strict with its head commit's timestamp.
  2. main CTE: every push event to refs/heads/main with its head commit's timestamp.
  3. ASOF LEFT JOIN main m ON v.pushed_ts >= m.pushed_ts pairs each viable/strict push with the most recent main push at-or-before it — i.e., the state of main at the moment viable/strict was bumped.
  4. lag_minutes = (m.commit_ts − v.commit_ts) / 60 — minutes between the two head commits' authoring times.

Test Plan: dashboard

Pushed to my test folder: https://pytorchci.grafana.net/dashboards/f/flrhzs

$ cd grafana
$ mise run push --folder flrhzs

Test Plan: sanity check against pytorch repo

Cross-checked against pytorch/pytorch for the last 12 months (3,489 rows, 5,088 distinct SHAs):

  • 5,087/5,088 SHAs present with head_commit.timestamp matching git's %ct exactly. The missing one was a force-pushed-and-reverted bad commit — real anomaly, not a query bug.
  • 30 sampled rows: v_commit is a git ancestor of the matched m_commit in all 30 (ASOF picks the right row).
  • 10 sampled rows spanning min/median/max lag: dashboard lag_minutes matches (repo_m_ct − repo_v_ct)/60 to the second.

Comment thread grafana/pytorch_devx.json Outdated
@jathu jathu marked this pull request as ready for review May 27, 2026 20:18
@jathu jathu force-pushed the jathu/grafana--pytorch-devx--changes branch from 8fdb29a to 328e755 Compare May 27, 2026 20:24
@georgehong
Copy link
Copy Markdown
Contributor

georgehong commented May 27, 2026

Are the changes currently reflected in the cited test dashboards and would it be possible to see a before and after of the charts?

I'm wondering if it's reasonable to see a sawtooth pattern here, and I also remember we had a somewhat prolonged viable/strict divergence on 3/20 for more than 4 days.

Comment thread grafana/pytorch_devx.json Outdated
@jathu jathu force-pushed the jathu/grafana--pytorch-devx--changes branch from 328e755 to be48b08 Compare May 27, 2026 21:55
@jathu jathu force-pushed the jathu/grafana--pytorch-devx--changes branch from be48b08 to cbdb29c Compare May 27, 2026 22:03
@jathu jathu added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit 0e9afb6 May 27, 2026
11 checks passed
@jathu jathu deleted the jathu/grafana--pytorch-devx--changes branch May 27, 2026 22:07
@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented May 27, 2026

@claude check if this is a similar query is used on test-infra https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/strict_lag_sec/query.sql

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented May 27, 2026

Oh darn, we don't have claude bot here yet I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants