Conversation
Add 19 production-ready Odin presets for DeepSeek-R1, gpt-oss-120b, GLM-4.7-Flash, and Kimi-K2.5 on NVIDIA H100/H200 configurations with TP, EP, DP, and Wide-EP parallelism strategies. Include research document with architecture analysis, memory fit calculations, and known issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change GLM-4.7-Flash image from v0.15.1 to glm5 (glm4_moe_lite arch not supported in bundled transformers) - Remove gpt-oss-120b single-GPU H100 preset (OOM: 64 GiB model leaves no headroom for KV cache on 80 GB GPU) - Update H100 accelerator.model label and nodeSelector from h100 to h100-80gb-hbm3 to match actual node labels Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gs and GPU models - Move GLM-4.7-Flash presets from v0.15.1/ to glm5/ directory to match actual vllm/vllm-openai:glm5 image tag - Rename H100 presets from h100 to h100-80gb-hbm3 in file names and metadata names to match actual node labels - Update research doc with E2E test results and removed preset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gs and E2E test results Add --compilation_config.pass_config.fuse_allreduce_rms and --enable-auto-tool-choice to all 3 Kimi-K2.5 presets per official vLLM recipe. Update research doc with GLM-4.7-Flash H100 TP2/TP4 PASS results, Kimi-K2.5 H100 garbled output analysis, and corrected known issues (tokenizer issue was PVC data problem, not vLLM compatibility). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r trust-remote-code models Models using --trust-remote-code on read-only PVCs fail because transformers cannot create its dynamic module directory. Adding HF_MODULES_CACHE=/tmp/hf_modules redirects the cache to a writable path. Also updates research doc with Kimi-K2.5 H100 retest results: garbled output confirmed as memory pressure (not thinking mode misconfiguration) after testing both instant and thinking modes with correct temperature/top_p per HF model card. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preset ISVC_EXTRA_ARGS fully overrides runtime-base's value during Odin strategic merge patch (env vars merge by name key), so logging flags from the runtime base are lost. Add --disable-uvicorn-access-log and --no-enable-log-requests to all 18 E2E presets and update AGENTS.md responsibility boundaries accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FP8 KV cache causes numerical instability on H100 for Kimi-K2.5 — output degenerates into garbled tokens after ~30 tokens. Short responses work fine but longer generation fails in both instant and thinking modes. Switching to --kv-cache-dtype auto (BF16) resolves the issue completely. Previous failures were compounded by incomplete model files on PVC (only 22/64 weight shards). After re-downloading all 64 shards and changing to BF16 KV cache, both modes produce correct output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntations - Fix naming convention: vllm-<image-tag>-... pattern, h100-80gb-hbm3 GPU identifier - Fix GLM preset names: vllm-v0.15.1- → vllm-glm5- prefix - Remove --enable-expert-parallel from ISVC_EXTRA_ARGS blocks; EP is controlled via spec.parallelism.expert: true in Odin spec, not the vLLM flag directly - Fix Kimi-K2.5 H100 preset: --kv-cache-dtype fp8 → auto (matches actual file) - Add explicit spec.parallelism blocks to EP/DP preset documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…finitions Rename h100-80gb-hbm3 to h100-sxm and h200 to h200-sxm in preset filenames, labels, and nodeSelectors to match the models defined in moai-accelerator.yaml. Remove research doc superseded by preset files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a set of Helm preset templates for vLLM-based E2E deployments on NVIDIA H100-SXM and H200-SXM, plus small supporting updates to utilities and preset documentation to accommodate trust-remote-code models and clarify responsibility boundaries.
Changes:
- Add vLLM v0.15.1 E2E preset templates for DeepSeek-R1, Kimi-K2.5, and OpenAI GPT-OSS-120B across H100-SXM/H200-SXM with multiple parallelism strategies.
- Add GLM-4.7-Flash E2E preset templates under the
glm5image tag for H100-SXM/H200-SXM. - Update the
vllm-hf-hub-offlineutility to setHF_MODULES_CACHE, and clarify indeploy/helm/AGENTS.mdthat logging args must be included in presets due to env-var merge semantics.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-preset/templates/utils/vllm-hf-hub-offline.helm.yaml | Adds HF_MODULES_CACHE for offline HF usage with trust-remote-code models. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml | New GPT-OSS-120B H200-SXM TP2 preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-1.helm.yaml | New GPT-OSS-120B H200-SXM 1-GPU preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp8-moe-tp8.helm.yaml | New GPT-OSS-120B H100-SXM TP8 preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml | New GPT-OSS-120B H100-SXM TP4 preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml | New GPT-OSS-120B H100-SXM TP2 preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml | New Kimi-K2.5 H200-SXM TP8 MoE(TP) preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | New Kimi-K2.5 H200-SXM TP8 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h100-sxm-tp8-moe-ep8.helm.yaml | New Kimi-K2.5 H100-SXM TP8 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml | New DeepSeek-R1 H200-SXM TP8 MoE(TP) preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | New DeepSeek-R1 H200-SXM TP8 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | New DeepSeek-R1 H200-SXM DP8 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp16-moe-ep16.helm.yaml | New DeepSeek-R1 H200-SXM DP16 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h100-sxm-dp16-moe-ep16.helm.yaml | New DeepSeek-R1 H100-SXM DP16 + expert-parallel preset. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml | New GLM-4.7-Flash H200-SXM TP2 preset (glm5 image). |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-1.helm.yaml | New GLM-4.7-Flash H200-SXM 1-GPU preset (glm5 image). |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp4-moe-tp4.helm.yaml | New GLM-4.7-Flash H100-SXM TP4 preset (glm5 image). |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-tp2-moe-tp2.helm.yaml | New GLM-4.7-Flash H100-SXM TP2 preset (glm5 image). |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h100-sxm-1.helm.yaml | New GLM-4.7-Flash H100-SXM 1-GPU preset (glm5 image). |
| deploy/helm/AGENTS.md | Documents logging-arg responsibility and explains env-var merge behavior. |
...set/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml
Outdated
Show resolved
Hide resolved
...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml
Outdated
Show resolved
Hide resolved
...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml
Outdated
Show resolved
Hide resolved
...mplates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp16-moe-ep16.helm.yaml
Outdated
Show resolved
Hide resolved
...et/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Outdated
Show resolved
Hide resolved
...erence-preset/templates/presets/vllm/v0.15.1/openai-gpt-oss-120b-nvidia-h200-sxm-1.helm.yaml
Outdated
Show resolved
Hide resolved
...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Outdated
Show resolved
Hide resolved
...et/templates/presets/vllm/v0.15.1/moonshotai-kimi-k2.5-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml
Outdated
Show resolved
Hide resolved
...eset/templates/presets/vllm/glm5/zai-org-glm-4.7-flash-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml
Outdated
Show resolved
Hide resolved
deploy/helm/moai-inference-preset/templates/utils/vllm-hf-hub-offline.helm.yaml
Outdated
Show resolved
Hide resolved
…ta.name All H200 presets had the -sxm suffix in filenames and labels but not in metadata.name, inconsistent with H100 presets which correctly included it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… offline cache guidance Clarify that offline HF cache env vars (HF_HOME, HF_HUB_OFFLINE, HF_MODULES_CACHE) belong in *-hf-hub-offline runtime base templates. Add root-level guidance on pre-populating HF_MODULES_CACHE for air-gapped trust_remote_code deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ine utility templates Change HF_MODULES_CACHE from /tmp/hf_modules to /mnt/models/modules so inference pods read pre-populated remote-code modules from the shared volume. Add the variable to dp and pp offline variants for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HF_MODULES_CACHE now belongs in the *-hf-hub-offline runtime base templates, not in individual presets. Remove it from the three Phi-mini-MoE quickstart presets to follow the updated responsibility boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…warm-up to PV guide Add HF_MODULES_CACHE=/mnt/models/modules to the model manager pod and offline template examples. Document the warm-up step that pre-populates dynamic module sources for trust_remote_code models on the shared volume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…resets and runtime bases Presets declare spec.parallelism values and labels; runtime bases assemble the actual CLI flags from those values. Update the responsibility boundaries section to reflect this split accurately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
...templates/presets/vllm/v0.15.1/deepseek-ai-deepseek-r1-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml
Outdated
Show resolved
Hide resolved
…8 MoE-EP8 preset Each worker in the DP8 configuration needs all 8 GPUs on the node for expert-parallel execution, not just 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ote-code warm-up in PV guide Install transformers alongside huggingface_hub so the model-manager pod can resolve dynamic modules directly. Remove the now-unnecessary HF_HOME/HF_MODULES_CACHE exports since the defaults align with the mount path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roduct-team section in AGENTS.md Move HF offline cache env vars from "Runtime bases define" to a new "Utils define" section since they belong to shared utility templates, not runtime bases. Remove the "Product team templates configure" section as those flags are no longer preset-relevant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HF_MODULES_CACHEtovllm-hf-hub-offlineutility for trust-remote-code modelsh100-sxm,h200-sxm) with NFD accelerator definitionsTest plan
helm templaterenders all new presets without errorsnodeSelectorvalues match NFDmoai-accelerator.yamllabelsAGENTS.mdrules🤖 Generated with Claude Code