Skip to content

MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs#98

Merged
hhk7734 merged 18 commits intomainfrom
MAF-19394
Mar 27, 2026
Merged

MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs#98
hhk7734 merged 18 commits intomainfrom
MAF-19394

Conversation

@hhk7734
Copy link
Copy Markdown
Member

@hhk7734 hhk7734 commented Mar 27, 2026

Summary

  • Add vLLM v0.15.1 E2E presets for NVIDIA H100-SXM and H200-SXM GPUs covering DeepSeek-R1, Kimi-K2.5, OpenAI GPT-OSS-120B, and GLM-4.7-Flash models
  • Include various parallelism strategies (TP, DP, EP, MoE) with E2E-validated configurations
  • Add HF_MODULES_CACHE to vllm-hf-hub-offline utility for trust-remote-code models
  • Align GPU model names (h100-sxm, h200-sxm) with NFD accelerator definitions

Test plan

  • Verify helm template renders all new presets without errors
  • Confirm preset nodeSelector values match NFD moai-accelerator.yaml labels
  • Validate preset naming convention follows AGENTS.md rules

🤖 Generated with Claude Code

hhk7734 and others added 9 commits March 27, 2026 14:11
Add 19 production-ready Odin presets for DeepSeek-R1, gpt-oss-120b,
GLM-4.7-Flash, and Kimi-K2.5 on NVIDIA H100/H200 configurations with
TP, EP, DP, and Wide-EP parallelism strategies. Include research
document with architecture analysis, memory fit calculations, and
known issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change GLM-4.7-Flash image from v0.15.1 to glm5 (glm4_moe_lite arch
  not supported in bundled transformers)
- Remove gpt-oss-120b single-GPU H100 preset (OOM: 64 GiB model leaves
  no headroom for KV cache on 80 GB GPU)
- Update H100 accelerator.model label and nodeSelector from h100 to
  h100-80gb-hbm3 to match actual node labels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gs and GPU models

- Move GLM-4.7-Flash presets from v0.15.1/ to glm5/ directory to match
  actual vllm/vllm-openai:glm5 image tag
- Rename H100 presets from h100 to h100-80gb-hbm3 in file names and
  metadata names to match actual node labels
- Update research doc with E2E test results and removed preset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gs and E2E test results

Add --compilation_config.pass_config.fuse_allreduce_rms and --enable-auto-tool-choice
to all 3 Kimi-K2.5 presets per official vLLM recipe. Update research doc with GLM-4.7-Flash
H100 TP2/TP4 PASS results, Kimi-K2.5 H100 garbled output analysis, and corrected known
issues (tokenizer issue was PVC data problem, not vLLM compatibility).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r trust-remote-code models

Models using --trust-remote-code on read-only PVCs fail because transformers
cannot create its dynamic module directory. Adding HF_MODULES_CACHE=/tmp/hf_modules
redirects the cache to a writable path.

Also updates research doc with Kimi-K2.5 H100 retest results: garbled output
confirmed as memory pressure (not thinking mode misconfiguration) after testing
both instant and thinking modes with correct temperature/top_p per HF model card.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preset ISVC_EXTRA_ARGS fully overrides runtime-base's value during Odin
strategic merge patch (env vars merge by name key), so logging flags from
the runtime base are lost. Add --disable-uvicorn-access-log and
--no-enable-log-requests to all 18 E2E presets and update AGENTS.md
responsibility boundaries accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FP8 KV cache causes numerical instability on H100 for Kimi-K2.5 —
output degenerates into garbled tokens after ~30 tokens. Short responses
work fine but longer generation fails in both instant and thinking modes.
Switching to --kv-cache-dtype auto (BF16) resolves the issue completely.

Previous failures were compounded by incomplete model files on PVC
(only 22/64 weight shards). After re-downloading all 64 shards and
changing to BF16 KV cache, both modes produce correct output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntations

- Fix naming convention: vllm-<image-tag>-... pattern, h100-80gb-hbm3 GPU identifier
- Fix GLM preset names: vllm-v0.15.1- → vllm-glm5- prefix
- Remove --enable-expert-parallel from ISVC_EXTRA_ARGS blocks; EP is controlled
  via spec.parallelism.expert: true in Odin spec, not the vLLM flag directly
- Fix Kimi-K2.5 H100 preset: --kv-cache-dtype fp8 → auto (matches actual file)
- Add explicit spec.parallelism blocks to EP/DP preset documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…finitions

Rename h100-80gb-hbm3 to h100-sxm and h200 to h200-sxm in preset
filenames, labels, and nodeSelectors to match the models defined in
moai-accelerator.yaml. Remove research doc superseded by preset files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhk7734 hhk7734 requested a review from a team as a code owner March 27, 2026 05:18
Copilot AI review requested due to automatic review settings March 27, 2026 05:18
@hhk7734 hhk7734 changed the title feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a set of Helm preset templates for vLLM-based E2E deployments on NVIDIA H100-SXM and H200-SXM, plus small supporting updates to utilities and preset documentation to accommodate trust-remote-code models and clarify responsibility boundaries.

Changes:

  • Add vLLM v0.15.1 E2E preset templates for DeepSeek-R1, Kimi-K2.5, and OpenAI GPT-OSS-120B across H100-SXM/H200-SXM with multiple parallelism strategies.
  • Add GLM-4.7-Flash E2E preset templates under the glm5 image tag for H100-SXM/H200-SXM.
  • Update the vllm-hf-hub-offline utility to set HF_MODULES_CACHE, and clarify in deploy/helm/AGENTS.md that logging args must be included in presets due to env-var merge semantics.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
deploy/helm/moai-inference-preset/templates/utils/vllm-hf-hub-offline.helm.yaml Adds HF_MODULES_CACHE for offline HF usage with trust-remote-code models.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml New GPT-OSS-120B H200-SXM TP2 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-1.helm.yaml New GPT-OSS-120B H200-SXM 1-GPU preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp8-moe-tp8.helm.yaml New GPT-OSS-120B H100-SXM TP8 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml New GPT-OSS-120B H100-SXM TP4 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml New GPT-OSS-120B H100-SXM TP2 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml New Kimi-K2.5 H200-SXM TP8 MoE(TP) preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml New Kimi-K2.5 H200-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h100-sxm-tp8-moe-ep8.helm.yaml New Kimi-K2.5 H100-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml New DeepSeek-R1 H200-SXM TP8 MoE(TP) preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml New DeepSeek-R1 H200-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml New DeepSeek-R1 H200-SXM DP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp16-moe-ep16.helm.yaml New DeepSeek-R1 H200-SXM DP16 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h100-sxm-dp16-moe-ep16.helm.yaml New DeepSeek-R1 H100-SXM DP16 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml New GLM-4.7-Flash H200-SXM TP2 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-1.helm.yaml New GLM-4.7-Flash H200-SXM 1-GPU preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml New GLM-4.7-Flash H100-SXM TP4 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml New GLM-4.7-Flash H100-SXM TP2 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-1.helm.yaml New GLM-4.7-Flash H100-SXM 1-GPU preset (glm5 image).
deploy/helm/AGENTS.md Documents logging-arg responsibility and explains env-var merge behavior.

hhk7734 and others added 6 commits March 27, 2026 16:49
…ta.name

All H200 presets had the -sxm suffix in filenames and labels but not in
metadata.name, inconsistent with H100 presets which correctly included it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… offline cache guidance

Clarify that offline HF cache env vars (HF_HOME, HF_HUB_OFFLINE,
HF_MODULES_CACHE) belong in *-hf-hub-offline runtime base templates.
Add root-level guidance on pre-populating HF_MODULES_CACHE for
air-gapped trust_remote_code deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ine utility templates

Change HF_MODULES_CACHE from /tmp/hf_modules to /mnt/models/modules so
inference pods read pre-populated remote-code modules from the shared
volume. Add the variable to dp and pp offline variants for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HF_MODULES_CACHE now belongs in the *-hf-hub-offline runtime base
templates, not in individual presets. Remove it from the three
Phi-mini-MoE quickstart presets to follow the updated responsibility
boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…warm-up to PV guide

Add HF_MODULES_CACHE=/mnt/models/modules to the model manager pod and
offline template examples. Document the warm-up step that pre-populates
dynamic module sources for trust_remote_code models on the shared volume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…resets and runtime bases

Presets declare spec.parallelism values and labels; runtime bases
assemble the actual CLI flags from those values. Update the
responsibility boundaries section to reflect this split accurately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 27, 2026 09:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

hhk7734 and others added 3 commits March 27, 2026 18:24
…8 MoE-EP8 preset

Each worker in the DP8 configuration needs all 8 GPUs on the node
for expert-parallel execution, not just 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ote-code warm-up in PV guide

Install transformers alongside huggingface_hub so the model-manager
pod can resolve dynamic modules directly. Remove the now-unnecessary
HF_HOME/HF_MODULES_CACHE exports since the defaults align with the
mount path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roduct-team section in AGENTS.md

Move HF offline cache env vars from "Runtime bases define" to a new
"Utils define" section since they belong to shared utility templates,
not runtime bases. Remove the "Product team templates configure"
section as those flags are no longer preset-relevant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhk7734 hhk7734 merged commit 244f2c9 into main Mar 27, 2026
3 checks passed
@hhk7734 hhk7734 deleted the MAF-19394 branch March 27, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants