Conversation
|
@claude please add a PR description |
|
Claude finished @kedarpotdar-nv's task in 40s —— View job Adding PR Description
PR description has been added with a summary of all changes: new |
|
@claude please review the PR |
|
Claude finished @kedarpotdar-nv's task in 2m 41s —— View job Reviewing PR
Review:
|
| Area | Verdict |
|---|---|
| Config YAML | ✅ Good |
| Launch script | 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval |
| Perf changelog | ✅ Good |
The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.
|
@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168 Thanks! |
Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend |
|
/sweep run only evals |
|
@kedarpotdar-nv Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181 |
|
@kedarpotdar-nv some merge conflicts then looks ready to go? |
|
@ankursingh-nv will take over |
86e3eec to
8d9f7c9
Compare
|
@claude resolve the merge conflicts & update the PRs to follow the updated conventions. remember that qwen3.5-fp4-b200-sglang should be at the bottom of perf-changelog.yaml, remember to delete 1k/8k from nvidia-master, remember to do the EVAL_CONTEXT_ARGS thing |
|
Claude finished @functionstackx's task in 3m 16s —— View job Resolving Merge Conflicts & Updating Conventions
SummaryResolved merge conflicts by rebasing cleanly on nvidia-master.yaml — Added benchmarks/single_node/qwen3.5_fp4_b200.sh — Rewritten to follow updated conventions from the fp8 sibling script:
perf-changelog.yaml — Entry added at the bottom. |
- Add qwen3.5-fp4-b200-sglang config to nvidia-master.yaml (1k1k, 8k1k) - Add launch script following updated conventions (dynamic scheduler interval, EVAL_CONTEXT_ARGS, tokenizer-worker-num, conditional allreduce fusion) - Add perf-changelog entry Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e5c143d to
e1e8cb2
Compare
Summary
Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.
Changes
New Benchmark Config (
nvidia-master.yaml)qwen3.5-fp4-b200-sglangnvidia/Qwen3.5-397B-A17B-NVFP4lmsysorg/sglang:v0.5.9-cu129-amd641k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)1k8k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)8k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)New Launch Script (
benchmarks/single_node/qwen3.5_fp4_b200.sh)SGLang server configuration with:
--quantization modelopt_fp4with--fp4-gemm-backend flashinfer_cutlass--kv-cache-dtype fp8_e4m3--attention-backend trtllm_mha/--moe-runner-backend flashinfer_trtllm--enable-flashinfer-allreduce-fusion--chunked-prefill-size 32768/--max-prefill-tokens 32768--disable-radix-cache--mem-fraction-static 0.85Perf Changelog
qwen3.5-fp4-b200-sglangconfig.