Skip to content

CI: add MoE perf regression check (bench_moe)#3300

Open
zhiding512 wants to merge 3 commits into
mainfrom
zhimding/add_flydsl_moe_benchmark_0521
Open

CI: add MoE perf regression check (bench_moe)#3300
zhiding512 wants to merge 3 commits into
mainfrom
zhimding/add_flydsl_moe_benchmark_0521

Conversation

@zhiding512
Copy link
Copy Markdown
Contributor

Catch MoE kernel performance regressions per-PR by piggybacking on the existing test_moe_2stage.py run in aiter-test:

  • test_moe_2stage.py drops a moe_bench.csv (CSV-mode rows only, perf-only)
  • standard job uploads the csv alongside latest_test.log
  • new bench_moe job (ubuntu, no GPU) downloads the linux-aiter-mi35x-1 shard csv, compares vs the last main baseline (artifact moe-bench-, 90d retention), reports to STEP_SUMMARY (warn-only for now)
  • main push / workflow_dispatch publishes the next baseline

Warn thresholds default 1.10/1.15 (slow ratio cur/base); --fail-on-regress is off until noise floor is characterized over 2-4 weeks.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Catch MoE kernel performance regressions per-PR by piggybacking on the
existing test_moe_2stage.py run in aiter-test:

- test_moe_2stage.py drops a moe_bench.csv (CSV-mode rows only, perf-only)
- standard job uploads the csv alongside latest_test.log
- new bench_moe job (ubuntu, no GPU) downloads the linux-aiter-mi35x-1
  shard csv, compares vs the last main baseline (artifact moe-bench-<SHA>,
  90d retention), reports to STEP_SUMMARY (warn-only for now)
- main push / workflow_dispatch publishes the next baseline

Warn thresholds default 1.10/1.15 (slow ratio cur/base); --fail-on-regress
is off until noise floor is characterized over 2-4 weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3300 --add-label <label>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a per-PR MoE kernel performance regression signal to CI by exporting MoE benchmark results from the existing test_moe_2stage.py run, comparing them against a main-branch baseline artifact, and reporting the comparison in the GitHub Actions step summary.

Changes:

  • Export moe_bench.csv from op_tests/test_moe_2stage.py and upload it with existing standard test artifacts.
  • Add a CI job (bench_moe) that downloads the MoE CSV from the MI35X shard, fetches a baseline artifact from main, and emits a regression table to GITHUB_STEP_SUMMARY.
  • Introduce a benchmark comparison CLI (scripts/compare_benchmark.py) plus a small wrapper script used by CI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
scripts/compare_benchmark.py New Python CLI to diff baseline vs current MoE benchmark CSVs and classify regressions.
op_tests/test_moe_2stage.py Writes moe_bench.csv from collected perf results (CSV-mode rows).
.github/workflows/aiter-test.yaml Uploads moe_bench.csv from standard tests and adds the bench_moe comparison/publish job.
.github/scripts/check_moe_regression.sh Wrapper to run the Python comparison with consistent labels/thresholds in CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +213 to +215
baseline, _ = _read_csv(args.baseline_csv)
current, key_cols = _read_csv(args.current_csv)

Comment on lines +124 to +130
key_cols = tuple(c for c in reader.fieldnames if c not in NON_KEY)
for raw in reader:
# Strip whitespace from every value to avoid silent join misses
# caused by trailing/leading spaces.
raw = {k: (v.strip() if isinstance(v, str) else v) for k, v in raw.items()}
key = tuple(sorted((c, raw.get(c, "")) for c in key_cols))
rows[key] = raw
Comment thread scripts/compare_benchmark.py Outdated
Comment on lines +270 to +272
# Build display_cols: walk key_cols, drop hidden, splice derived in place
_derived_sources = {src for sources in DERIVED_TUPLE_COLS.values() for src in sources}
_derived_first_src = {sources[0]: name for name, sources in DERIVED_TUPLE_COLS.items()}
csv_df = df[df["model"] != "legacy"].copy()
else:
csv_df = df.copy()
csv_df = csv_df.drop(columns=["logits_diff"], errors="ignore")
Comment thread .github/workflows/aiter-test.yaml Outdated
Comment on lines +863 to +868
echo "## MoE Bench (vs baseline)" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
bash .github/scripts/check_moe_regression.sh \
"$baseline_csv" /tmp/current.csv \
| tee -a "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants