Skip to content

ci: add amd-ci-job-monitor for runner fleet reporting#543

Open
amdfaa wants to merge 1 commit into
mainfrom
feat/amd-ci-job-monitor
Open

ci: add amd-ci-job-monitor for runner fleet reporting#543
amdfaa wants to merge 1 commit into
mainfrom
feat/amd-ci-job-monitor

Conversation

@amdfaa
Copy link
Copy Markdown

@amdfaa amdfaa commented May 19, 2026

Adds runner-fleet-report artifact pipeline for AI Frameworks Dashboard import (ported from ROCm/aiter).

Made with Cursor

Copilot AI review requested due to automatic review settings May 19, 2026 23:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a scheduled/manual GitHub Actions workflow that inventories recent Actions job execution and generates artifacts (including a “runner fleet report”) intended for import into an AI Frameworks Dashboard.

Changes:

  • Added AMD CI Job Monitor workflow to discover workflows/jobs, snapshot recent Actions data, and publish per-job and fleet-wide reports as artifacts.
  • Added .github/scripts/query_job_status.py to query Actions runs/jobs (or consume a snapshot) and emit markdown summaries plus runner concurrency/fleet reporting.
  • Added .github/scripts/list_jobs.py and updated .github/runner-config.yml to support workflow/job discovery and runner metadata enrichment.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/amd-ci-job-monitor.yml Adds a scheduled/dispatch workflow that builds snapshots, per-job reports, and a runner fleet report artifact.
.github/scripts/query_job_status.py Implements GitHub API querying + snapshotting and produces job/runner summary tables.
.github/scripts/list_jobs.py Discovers workflow jobs and generates a matrix/workflow map for the monitor workflow.
.github/runner-config.yml Updates runner label → GPU metadata used to annotate runner fleet reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +32
parser.add_argument(
"--exclude-jobs",
default="",
help="Comma-separated job names to skip.",
)
Comment on lines +83 to +90
for job_id in job_ids:
job_def = jobs_dict.get(job_id) or {}
raw_name = job_def.get("name") if isinstance(job_def, dict) else None
if isinstance(raw_name, str) and "${{" not in raw_name:
display_name = raw_name
else:
display_name = job_id
display_names.append(display_name)
default: ""
type: string
exclude_jobs:
description: "Comma-separated job names to exclude"
Comment on lines +491 to +506
events.sort(key=lambda item: (item[0], item[1]))
concurrent = 0
peak = 0
time_weighted_sum = 0.0
total_time = 0.0
previous_time = events[0][0]

for timestamp, delta in events:
if concurrent > 0:
elapsed = (timestamp - previous_time).total_seconds()
if elapsed > 0:
time_weighted_sum += concurrent * elapsed
total_time += elapsed
concurrent += delta
peak = max(peak, concurrent)
previous_time = timestamp
@gyohuangxin
Copy link
Copy Markdown
Member

@amdfaa Can you fix the code style error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants