MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs by hhk7734 · Pull Request #98 · moreh-dev/mif

hhk7734 · 2026-03-27T05:18:35Z

Summary

Add vLLM v0.15.1 E2E presets for NVIDIA H100-SXM and H200-SXM GPUs covering DeepSeek-R1, Kimi-K2.5, OpenAI GPT-OSS-120B, and GLM-4.7-Flash models
Include various parallelism strategies (TP, DP, EP, MoE) with E2E-validated configurations
Add HF_MODULES_CACHE to vllm-hf-hub-offline utility for trust-remote-code models
Align GPU model names (h100-sxm, h200-sxm) with NFD accelerator definitions

Test plan

Verify helm template renders all new presets without errors
Confirm preset nodeSelector values match NFD moai-accelerator.yaml labels
Validate preset naming convention follows AGENTS.md rules

🤖 Generated with Claude Code

Add 19 production-ready Odin presets for DeepSeek-R1, gpt-oss-120b, GLM-4.7-Flash, and Kimi-K2.5 on NVIDIA H100/H200 configurations with TP, EP, DP, and Wide-EP parallelism strategies. Include research document with architecture analysis, memory fit calculations, and known issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Change GLM-4.7-Flash image from v0.15.1 to glm5 (glm4_moe_lite arch not supported in bundled transformers) - Remove gpt-oss-120b single-GPU H100 preset (OOM: 64 GiB model leaves no headroom for KV cache on 80 GB GPU) - Update H100 accelerator.model label and nodeSelector from h100 to h100-80gb-hbm3 to match actual node labels Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gs and GPU models - Move GLM-4.7-Flash presets from v0.15.1/ to glm5/ directory to match actual vllm/vllm-openai:glm5 image tag - Rename H100 presets from h100 to h100-80gb-hbm3 in file names and metadata names to match actual node labels - Update research doc with E2E test results and removed preset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gs and E2E test results Add --compilation_config.pass_config.fuse_allreduce_rms and --enable-auto-tool-choice to all 3 Kimi-K2.5 presets per official vLLM recipe. Update research doc with GLM-4.7-Flash H100 TP2/TP4 PASS results, Kimi-K2.5 H100 garbled output analysis, and corrected known issues (tokenizer issue was PVC data problem, not vLLM compatibility). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r trust-remote-code models Models using --trust-remote-code on read-only PVCs fail because transformers cannot create its dynamic module directory. Adding HF_MODULES_CACHE=/tmp/hf_modules redirects the cache to a writable path. Also updates research doc with Kimi-K2.5 H100 retest results: garbled output confirmed as memory pressure (not thinking mode misconfiguration) after testing both instant and thinking modes with correct temperature/top_p per HF model card. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Preset ISVC_EXTRA_ARGS fully overrides runtime-base's value during Odin strategic merge patch (env vars merge by name key), so logging flags from the runtime base are lost. Add --disable-uvicorn-access-log and --no-enable-log-requests to all 18 E2E presets and update AGENTS.md responsibility boundaries accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

FP8 KV cache causes numerical instability on H100 for Kimi-K2.5 — output degenerates into garbled tokens after ~30 tokens. Short responses work fine but longer generation fails in both instant and thinking modes. Switching to --kv-cache-dtype auto (BF16) resolves the issue completely. Previous failures were compounded by incomplete model files on PVC (only 22/64 weight shards). After re-downloading all 64 shards and changing to BF16 KV cache, both modes produce correct output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntations - Fix naming convention: vllm-<image-tag>-... pattern, h100-80gb-hbm3 GPU identifier - Fix GLM preset names: vllm-v0.15.1- → vllm-glm5- prefix - Remove --enable-expert-parallel from ISVC_EXTRA_ARGS blocks; EP is controlled via spec.parallelism.expert: true in Odin spec, not the vLLM flag directly - Fix Kimi-K2.5 H100 preset: --kv-cache-dtype fp8 → auto (matches actual file) - Add explicit spec.parallelism blocks to EP/DP preset documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…finitions Rename h100-80gb-hbm3 to h100-sxm and h200 to h200-sxm in preset filenames, labels, and nodeSelectors to match the models defined in moai-accelerator.yaml. Remove research doc superseded by preset files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a set of Helm preset templates for vLLM-based E2E deployments on NVIDIA H100-SXM and H200-SXM, plus small supporting updates to utilities and preset documentation to accommodate trust-remote-code models and clarify responsibility boundaries.

Changes:

Add vLLM v0.15.1 E2E preset templates for DeepSeek-R1, Kimi-K2.5, and OpenAI GPT-OSS-120B across H100-SXM/H200-SXM with multiple parallelism strategies.
Add GLM-4.7-Flash E2E preset templates under the glm5 image tag for H100-SXM/H200-SXM.
Update the vllm-hf-hub-offline utility to set HF_MODULES_CACHE, and clarify in deploy/helm/AGENTS.md that logging args must be included in presets due to env-var merge semantics.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
deploy/helm/moai-inference-preset/templates/utils/vllm-hf-hub-offline.helm.yaml	Adds `HF_MODULES_CACHE` for offline HF usage with trust-remote-code models.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml	New GPT-OSS-120B H200-SXM TP2 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-1.helm.yaml	New GPT-OSS-120B H200-SXM 1-GPU preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp8-moe-tp8.helm.yaml	New GPT-OSS-120B H100-SXM TP8 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml	New GPT-OSS-120B H100-SXM TP4 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml	New GPT-OSS-120B H100-SXM TP2 preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml	New Kimi-K2.5 H200-SXM TP8 MoE(TP) preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	New Kimi-K2.5 H200-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h100-sxm-tp8-moe-ep8.helm.yaml	New Kimi-K2.5 H100-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml	New DeepSeek-R1 H200-SXM TP8 MoE(TP) preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	New DeepSeek-R1 H200-SXM TP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	New DeepSeek-R1 H200-SXM DP8 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp16-moe-ep16.helm.yaml	New DeepSeek-R1 H200-SXM DP16 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h100-sxm-dp16-moe-ep16.helm.yaml	New DeepSeek-R1 H100-SXM DP16 + expert-parallel preset.
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml	New GLM-4.7-Flash H200-SXM TP2 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-1.helm.yaml	New GLM-4.7-Flash H200-SXM 1-GPU preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml	New GLM-4.7-Flash H100-SXM TP4 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml	New GLM-4.7-Flash H100-SXM TP2 preset (glm5 image).
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-1.helm.yaml	New GLM-4.7-Flash H100-SXM 1-GPU preset (glm5 image).
deploy/helm/AGENTS.md	Documents logging-arg responsibility and explains env-var merge behavior.

...set/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml

...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml

...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml

...mplates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp16-moe-ep16.helm.yaml

...et/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml

...erence-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-1.helm.yaml

...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml

...et/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml

...eset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml

deploy/helm/moai-inference-preset/templates/utils/vllm-hf-hub-offline.helm.yaml

…ta.name All H200 presets had the -sxm suffix in filenames and labels but not in metadata.name, inconsistent with H100 presets which correctly included it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… offline cache guidance Clarify that offline HF cache env vars (HF_HOME, HF_HUB_OFFLINE, HF_MODULES_CACHE) belong in *-hf-hub-offline runtime base templates. Add root-level guidance on pre-populating HF_MODULES_CACHE for air-gapped trust_remote_code deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ine utility templates Change HF_MODULES_CACHE from /tmp/hf_modules to /mnt/models/modules so inference pods read pre-populated remote-code modules from the shared volume. Add the variable to dp and pp offline variants for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HF_MODULES_CACHE now belongs in the *-hf-hub-offline runtime base templates, not in individual presets. Remove it from the three Phi-mini-MoE quickstart presets to follow the updated responsibility boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…warm-up to PV guide Add HF_MODULES_CACHE=/mnt/models/modules to the model manager pod and offline template examples. Document the warm-up step that pre-populates dynamic module sources for trust_remote_code models on the shared volume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…resets and runtime bases Presets declare spec.parallelism values and labels; runtime bases assemble the actual CLI flags from those values. Update the responsibility boundaries section to reflect this split accurately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

website/docs/operations/hf-model-management-with-pv.mdx

...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml

…8 MoE-EP8 preset Each worker in the DP8 configuration needs all 8 GPUs on the node for expert-parallel execution, not just 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ote-code warm-up in PV guide Install transformers alongside huggingface_hub so the model-manager pod can resolve dynamic modules directly. Remove the now-unnecessary HF_HOME/HF_MODULES_CACHE exports since the defaults align with the mount path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…roduct-team section in AGENTS.md Move HF offline cache env vars from "Runtime bases define" to a new "Utils define" section since they belong to shared utility templates, not runtime bases. Remove the "Product team templates configure" section as those flags are no longer preset-relevant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hhk7734 and others added 9 commits March 27, 2026 14:11

hhk7734 requested a review from a team as a code owner March 27, 2026 05:18

Copilot AI review requested due to automatic review settings March 27, 2026 05:18

gitgod-bot assigned hhk7734 Mar 27, 2026

hhk7734 changed the title ~~feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs~~ MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs Mar 27, 2026

Copilot started reviewing on behalf of hhk7734 March 27, 2026 05:22 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

hhk7734 and others added 6 commits March 27, 2026 16:49

Copilot AI review requested due to automatic review settings March 27, 2026 09:09

Copilot started reviewing on behalf of hhk7734 March 27, 2026 09:09 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

hhk7734 and others added 3 commits March 27, 2026 18:24

MAF-19394: fix(preset): correct GPU count to 8 in DeepSeek-R1 H200 DP…

b92d464

…8 MoE-EP8 preset Each worker in the DP8 configuration needs all 8 GPUs on the node for expert-parallel execution, not just 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hhk7734 merged commit 244f2c9 into main Mar 27, 2026
3 checks passed

hhk7734 deleted the MAF-19394 branch March 27, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs#98

MAF-19394: feat(preset): add vLLM v0.15.1 E2E presets for H100/H200 GPUs#98
hhk7734 merged 18 commits intomainfrom
MAF-19394

hhk7734 commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hhk7734 commented Mar 27, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants