Separate eval-only workflow and change to 8k1k by Oseltamivir · Pull Request #911 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-03-15T21:15:10Z

Summary

Remove 1k8k
Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
Changed from 1k8k to 8k1k
Groups now include dp-attn (previously missing, so dp-attn variants were merged)
Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Implementation Details

get_native_max_context_length(model) — reads max_position_embeddings / max_sequence_length / seq_length / n_positions from the model's HuggingFace config.
max_output_tokens: dynamically computed as eval_context_len - 4096 (capped at
16384), was fixed 8192
Score validation: new validate_scores.py — post-eval gate checking exact_match metrics ≥ 0.85, exits non-zero on failure or zero metrics found
Failure mode: eval-only uses if-no-files-found: error (was ignore)
GSM8k prompt: #### <answer> → #### [number]
ATOM torchvision workaround: uninstalls torchvision before lm-eval install to avoid circular import
PYTHONDONTWRITEBYTECODE=1 added to workflow env to avoid EACCES error on pycache.

New job step: evals

Normal job with perf-changelog, no 1k8k

- Switch eval selection from 1k8k to 8k1k in mark_eval_entries() - After throughput benchmark, kill server and restart with model's native max context length (max_position_embeddings) for eval - Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with the native max model length - Add _start_eval_server() supporting sglang, vllm, trtllm, and atom - Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max_tokens in gen_kwargs must be less than max_length to leave room for the input prompt. Without this cap, the server rejects requests where input_tokens + max_tokens > max_model_len. Uses max_length - 4096 as the generation cap (min 8192). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --no-skip-eval-only flag to wait_for_server_ready so _start_eval_server properly waits for the eval server - Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC - Cap eval max_gen_tokens to 8192 to avoid KV cache issues - Read num_fewshot from eval YAML instead of CLI override - Add 'trt' as alias for trtllm in eval server case - Reduce seq len configs for OOM prevention - Change eval sweep to top of curve and middle configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t error The ATOM container has a broken torchvision that causes circular import errors when lm_eval loads. Since we use local-chat-completions (API-based), torchvision is not needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

15 SGLang scripts now pass --context-length with compute_eval_context_length when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly. Reverts the max_gen_tokens cap since the server should have sufficient context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # benchmarks/single_node/dsr1_fp4_b200.sh # benchmarks/single_node/dsr1_fp4_mi355x.sh # benchmarks/single_node/dsr1_fp8_b200.sh # benchmarks/single_node/dsr1_fp8_b200_mtp.sh # benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_mi300x.sh # benchmarks/single_node/dsr1_fp8_mi355x.sh # benchmarks/single_node/glm5_fp8_mi355x.sh # benchmarks/single_node/gptoss_fp4_mi300x.sh # benchmarks/single_node/gptoss_fp4_mi325x.sh # benchmarks/single_node/gptoss_fp4_mi355x.sh # benchmarks/single_node/kimik2.5_fp4_b200.sh # benchmarks/single_node/kimik2.5_int4_b200.sh # benchmarks/single_node/kimik2.5_int4_mi325x.sh # benchmarks/single_node/kimik2.5_int4_mi355x.sh # benchmarks/single_node/minimaxm2.5_fp8_b200.sh # benchmarks/single_node/minimaxm2.5_fp8_h200.sh # benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh # benchmarks/single_node/qwen3.5_bf16_mi300x.sh # benchmarks/single_node/qwen3.5_bf16_mi325x.sh # benchmarks/single_node/qwen3.5_bf16_mi355x.sh # benchmarks/single_node/qwen3.5_fp8_mi300x.sh # benchmarks/single_node/qwen3.5_fp8_mi325x.sh # benchmarks/single_node/qwen3.5_fp8_mi355x.sh

github-actions · 2026-03-15T21:15:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

functionstackx

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

does this PR change the evals to use the industry standard num of shots for gsm8k?
previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

.github/workflows/benchmark-tmpl.yml

.github/workflows/e2e-tests.yml

benchmarks/single_node/kimik2.5_fp4_b200.sh

benchmarks/benchmark_lib.sh

utils/evals/EVALS.md

utils/matrix_logic/generate_sweep_configs.py

Oseltamivir · 2026-03-15T22:08:58Z

Changed to 5-shot
Yes, also added suggestion of failing when result <85%

perf-changelog.yaml

.github/workflows/profile.yml

functionstackx

.

benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh

claude

Two critical bugs flagged in prior review rounds remain unpatched in the actual code despite their GitHub conversation threads being marked resolved — please verify the fixes landed in the branch.

Extended reasoning...

Overview

This PR (72 files) separates eval-only CI runs from throughput benchmarks and switches eval workloads to an 8k/1k (ISL/OSL) profile. The changes touch CI workflow YAML, benchmark_lib.sh, generate_sweep_configs.py, process_changelog.py, validate_scores.py, and every single-node benchmark shell script.

Bugs Still Present in Current Branch

Two issues previously marked resolved on GitHub are not fixed in the files on the branch:

benchmarks/benchmark_lib.sh line 624 — local eval_ctx=$(( benchmark_ctx * 1 )) still multiplies by 1 instead of 5. This makes compute_eval_context_length() a no-op: the server launched in eval-only mode gets exactly the same (tiny) context window as during throughput benchmarks, defeating the entire purpose of the EVAL_ONLY flow.
benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh lines 87–88 — A stray set -x followed by rm -rf ./utils/bench_serving\ (with a bash line-continuation backslash) still unconditionally deletes the benchmark serving utilities after every run on MI355X hardware.

Security / Stability Risk

The rm -rf on line 88 is a destructive filesystem operation that runs outside any conditional block. It will silently destroy tooling on every benchmark execution, requiring manual recovery. The * 1 multiplier bug silently produces incorrect eval results (accuracy measured at the wrong context length) without any error.

Level of Scrutiny

Both of these were flagged as critical in earlier review rounds. The inline comments were marked resolved, which typically signals the author fixed them — but cross-checking the actual file contents shows the fixes were not committed. This needs explicit human confirmation that the correct patch is in place before merging.

Other Factors

A fix verified in code (dsr1_fp8_b200_trt_mtp.sh EVAL_ONLY ordering) shows the process works when changes are actually committed. The remaining two issues just need the already-identified one-line fixes to be applied and pushed.

benchmarks/single_node/kimik2.5_int4_mi355x.sh

Oseltamivir · 2026-03-28T05:55:20Z

@claude

Review this PR. Ignore earlier resolved issues, even if not changed like only 1 conc being ran if there are only 2 to sweep.

Klaud-Cold · 2026-03-28T05:55:41Z

Claude finished @Oseltamivir's task in 2m 11s —— View job

PR Review: Separate eval-only workflow and change to 8k1k

Read key files and gather context
Review benchmark_lib.sh changes
Review workflow changes (benchmark-tmpl, e2e-tests, run-sweep)
Review benchmark scripts
Review Python utility changes (generate_sweep_configs, process_changelog, validate_scores)
Review documentation and config changes
Post final review

Overall: Well-structured PR. The eval-only workflow separation is clean, process_changelog.py correctly generates separate benchmark/eval matrices with independent dedup, validate_scores.py properly fails on zero metrics checked, and all Docker launch scripts correctly forward EVAL_ONLY. Most previously flagged issues have been addressed. Two critical issues remain.

Critical Issues

1. compute_eval_context_length multiplier is 1 instead of 5 — benchmarks/benchmark_lib.sh:624

local eval_ctx=$(( benchmark_ctx * 1 ))

The function's own docstring (line 609) says "Uses 5x the benchmark context capped at the model's native max." With * 1, the eval context is never expanded — for a typical 8k1k workload, the server gets 9416 tokens instead of ~47080. This makes eval-only mode's core feature (expanded context) a complete no-op across all 49 scripts. This was flagged multiple times in prior review rounds and marked resolved, but the code was never actually changed.

Fix this →

2. Stray destructive rm -rf in dsr1_fp8_mi355x_atom_mtp.sh — lines 87-88

set -x
rm -rf ./utils/bench_serving\

After the script's normal ending (stop_gpu_monitor / set +x at line 85), these two stray lines execute unconditionally. The trailing backslash is a line continuation — bash resolves this as rm -rf ./utils/bench_serving, deleting the benchmark serving utilities directory from the workspace on every run. Compare with the sibling dsr1_fp4_mi355x_atom_mtp.sh which ends cleanly at set +x. This was also flagged previously and marked resolved but remains in the code.

Fix this →

Minor Issues

3. Cleanup step doesn't remove sample*.jsonl — benchmark-tmpl.yml:227-230

The cleanup step removes meta_env.json and results*.json but not sample*.jsonl, which append_lm_eval_summary also moves into the workspace CWD and which is included in the artifact upload. On persistent self-hosted runners, these accumulate and contaminate future artifact uploads.

# Add this line:
rm -f sample*.jsonl || true

4. collect-evals in run-sweep.yml runs when sweep-evals is skipped — line 209

if: ${{ always() && needs.setup.result != 'skipped' }}

When a PR produces no 8k1k eval configs, sweep-evals is skipped, no eval_* artifacts exist, but collect-evals still runs. actions/download-artifact@v8 will fail when no artifacts match the pattern. Same issue in e2e-tests.yml:171-173 where collect-evals uses if: ${{ always() }}.

Fix: add && needs.sweep-evals.result != 'skipped' in run-sweep.yml and && needs.test-sweep-evals.result != 'skipped' in e2e-tests.yml.

5. e2e-tests.yml SINGLE filter includes eval entries — line 57

When --evals-only is passed, generate_sweep_configs.py outputs only entries with run-eval: True. The SINGLE filter [x for x in d if 'prefill' not in x] captures all of them, so every config runs as both a full benchmark (via test-sweep-single-node) and an eval-only run (via test-sweep-evals), wasting GPU time. Adding and not x.get('run-eval', False) to the SINGLE filter would fix this.

Looks Good

GSM8k prompt fix (#### [number] instead of #### <answer>) with 5-shot — good fix
validate_scores.py correctly returns 1 when checked == 0 (line 86) — the zero-metrics silent pass issue is properly handled
EVAL_ONLY correctly forwarded in all 4 Docker launch scripts (-e EVAL_ONLY on line 36 of launch_b200-dgxc.sh)
All --concurrent-requests $CONC removed from benchmark scripts (0 grep matches)
process_changelog.py uses separate benchmark_configs_seen / eval_configs_seen sets (lines 97-98) — no cross-contamination
setup_eval_context() wrapper (line 640) correctly avoids the subshell export issue by calling compute_eval_context_length directly and exporting afterward
Artifact naming collision addressed with eval_server_logs_ / eval_gpu_metrics_ conditional prefixes (lines 202, 210)
always() guard correctly added to eval results upload (line 215) and cleanup step (line 226)
thresholds.json with per-task thresholds (gsm8k: 0.85, gpqa: 0.30) — sensible approach
EVALS.md and AGENTS.md documentation updated to reflect new selection policy
glm5_fp8_h200.sh has the EVAL_ONLY block and $EVAL_CONTEXT_ARGS correctly wired up

|

claude

Two critical bugs from earlier rounds remain in the code despite their threads being marked resolved: (1) compute_eval_context_length still uses benchmark_ctx * 1 instead of * 5 (benchmarks/benchmark_lib.sh:624), so eval context is never actually expanded — defeating the core feature of this PR; (2) stray set -x / rm -rf ./utils/bench_serving\\ debug lines remain at benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh:87-88, which will execute at runtime and destructively delete the bench_serving directory.

Extended reasoning...

Overview

PR #911 decouples eval-only workflows from benchmark jobs across 74 files: new eval-only input in benchmark-tmpl.yml, a new sweep-evals job, eval context expansion across ~49 single-node scripts via setup_eval_context(), and supporting tooling updates (validate_scores.py, process_changelog.py, GSM8K prompt fix, documentation).

Security Risks

No security-sensitive code paths are involved. This is CI/CD pipeline automation for GPU benchmark infrastructure.

Level of Scrutiny

This is a large, complex PR touching 74 files across critical CI/CD workflows and benchmark scripts that execute on expensive GPU hardware. The eval context expansion feature is the primary new capability, making multiplier correctness essential. Multiple rounds of review (38 inline comments) were needed to surface and address numerous bugs; extra scrutiny is warranted.

Other Factors

All 38 previously posted inline comments are now marked as resolved, but at least two critical bugs are still present in the HEAD commit. Because GitHub hides resolved threads by default, a human reviewer would not see these without explicitly expanding them:

Multiplier still 1 instead of 5 — benchmarks/benchmark_lib.sh:624 reads local eval_ctx=$(( benchmark_ctx * 1 )). The function docstring says it uses "5x the benchmark context" but the multiplier is 1, so the eval server is given the same context window as throughput benchmarks. This breaks the stated purpose of eval-only mode.
Stray destructive commands — benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh:87-88 contains set -x followed by rm -rf ./utils/bench_serving\\ (with bash line-continuation). These are leftover debug lines that will execute on every run, deleting the bench_serving directory and causing failures for all subsequent benchmark operations on that runner.

.github/workflows/benchmark-tmpl.yml

Oseltamivir and others added 22 commits March 8, 2026 08:31

fix: reserve 30% of context for prompt in eval gen_kwargs

4452868

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

change eval calling

83cd7de

change gsm8k to 8-shot

73b6846

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

decouple

aad5336

add gpqa

1cdc72a

fix: pass multiple eval tasks as separate args for older lm-eval compat

49b0b90

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: run eval tasks sequentially for cross-version lm-eval compat

cc59d50

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use directory-based task discovery for multi-eval in single lm_e…

c8b5858

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gsm8k only

ffee6c5

Merge branch 'main' into eval-8k1k-server-restart

c577ca2

fix: cap max_gen_tokens to server's max_model_len to avoid request re…

dc25ccd

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: default EVAL_TASKS_DIR to utils/evals directory, not single yaml…

2d9d7ba

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: reduce eval context multiplier to 5x and increase request timeou…

1ca2173

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test other prompt

6999263

pr

ba45203

github-project-automation bot added this to InferenceMAX Board Mar 15, 2026

functionstackx requested changes Mar 15, 2026

View reviewed changes

claude bot reviewed Mar 15, 2026

View reviewed changes

Oseltamivir requested a review from a team March 15, 2026 22:08

test evals

826035c

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08

cquil11 reviewed Mar 23, 2026

View reviewed changes

perf-changelog.yaml Show resolved Hide resolved

cquil11 requested changes Mar 23, 2026

View reviewed changes

.github/workflows/profile.yml Outdated Show resolved Hide resolved

functionstackx reviewed Mar 23, 2026

View reviewed changes

Oseltamivir added 2 commits March 23, 2026 21:40

remove sudo rm

99f5d21

Merge branch 'main' into eval-8k1k-server-restart

c23031b

claude bot reviewed Mar 24, 2026

View reviewed changes

benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh Show resolved Hide resolved

Oseltamivir added 2 commits March 25, 2026 22:05

Merge branch 'main' into eval-8k1k-server-restart

ac655d3

Fix newline at end of file in perf-changelog.yaml

bec4dba

claude bot reviewed Mar 26, 2026

View reviewed changes

Merge main

f4c332b

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 79db8ec to f4c332b Compare March 27, 2026 04:35

Oseltamivir requested review from billishyahao and chunfangamd as code owners March 27, 2026 04:35

Merge branch 'main' into eval-8k1k-server-restart

87f711e

claude bot reviewed Mar 27, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_int4_mi355x.sh Show resolved Hide resolved

eliminate 1k8k

ed3b2fc

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners March 28, 2026 05:12

Oseltamivir removed the sweep-enabled label Mar 28, 2026

Merge branch 'main' into eval-8k1k-server-restart

194d52a

Oseltamivir added the sweep-enabled label Mar 28, 2026

changelog

f5fd82c

claude bot reviewed Mar 28, 2026

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

final

ecb4717

Oseltamivir merged commit 626932a into main Mar 28, 2026
61 of 159 checks passed

Oseltamivir deleted the eval-8k1k-server-restart branch March 28, 2026 06:38

github-project-automation bot moved this to Done in InferenceMAX Board Mar 28, 2026

Klaud-Cold mentioned this pull request Mar 28, 2026

Disable prefix caching for qwen3.5 & glm5 AMD benchmarks #969

Closed

Conversation

Oseltamivir commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Details

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

functionstackx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oseltamivir commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Bugs Still Present in Current Branch

Security / Stability Risk

Level of Scrutiny

Other Factors

Uh oh!

Uh oh!

Oseltamivir commented Mar 28, 2026

Uh oh!

Klaud-Cold commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Separate eval-only workflow and change to 8k1k

Critical Issues

Minor Issues

Looks Good

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security Risks

Level of Scrutiny

Other Factors

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oseltamivir commented Mar 15, 2026 •

edited

Loading

functionstackx left a comment •

edited

Loading

Klaud-Cold commented Mar 28, 2026 •

edited

Loading