Summary
The Llama 3.3 70B documentation at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/llama3/llama_33_70b.html contains several errors that prevent the examples from working correctly.
Issues Found
1. Offline Serving Example - Incorrect Parameters
Location: Offline serving section for trn2.48xlarge (lines 8-16)
Current Code:
llm = LLM(
model="~/models/Llama-3.3-70B-Instruct/",
tensor_parallel_size=64,
max_num_seqs=1,
max_model_len=16384,
device="neuron", # ❌ INCORRECT
use_v2_block_manager=True, # ❌ INCORRECT
override_neuron_config={}, # ❌ INCORRECT
)
Problems:
device="neuron" - Not needed/supported in vLLM Neuron
use_v2_block_manager=True - Deprecated/incorrect parameter
override_neuron_config={} - Should be additional_config with proper structure
- Missing
dtype specification (should be bfloat16 for best quality)
- Missing
block_size parameter (required when prefix caching is enabled)
- Missing
enable_prefix_caching and enable_chunked_prefill flags
Corrected Code:
llm = LLM(
model="~/models/Llama-3.3-70B-Instruct/",
tensor_parallel_size=64,
max_num_seqs=1,
max_model_len=16384,
dtype="bfloat16",
block_size=16384,
enable_prefix_caching=False,
enable_chunked_prefill=False,
additional_config={
'override_neuron_config': {
'batch_size': 1,
'tp_degree': 64,
'enable_bucketing': True,
'is_continuous_batching': True,
'logical_nc_config': 2,
'seq_len': 16384,
'torch_dtype': 'bfloat16',
}
},
)
2. Online Serving Section - Missing Similar options
Location: Online serving section for trn2.48xlarge
Expected Content:
VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
--model="~/models/Llama-3.3-70B-Instruct/" \
--tensor-parallel-size=64 \
--max-num-seqs=1 \
--max-model-len=16384 \
--dtype=bfloat16 \
--additional-config='{"override_neuron_config": {"async_mode": true, "batch_size": 1, "tp_degree": 64, "attn_block_tkg_nki_kernel_cache_update": true, "attn_block_tkg_nki_kernel_enabled": true, "attn_kernel_enabled": true, "cc_pipeline_tiling_factor": 1, "enable_bucketing": true, "fused_qkv": true, "is_continuous_batching": true, "k_cache_transposed": true, "kv_cache_tiling": false, "logical_nc_config": 2, "mlp_kernel_enabled": true, "qkv_kernel_enabled": true, "seq_len": 16384, "sequence_parallel_enabled": true, "token_generation_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "context_encoding_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "on_device_sampling_config": {"do_sample": true, "dynamic": true}, "torch_dtype": "bfloat16"}}' \
--port=8080
3. Recommended Configuration - Missing torch_dtype
Location: Both offline and online serving recommended configurations
Problem: The recommended NeuronConfig examples don't specify torch_dtype, which should be "bfloat16" for optimal quality with Llama 3.3 70B.
Suggested Addition:
Testing Status
✅ Tested on: trn2.48xlarge instance with 64 NeuronCores
✅ Model compilation: Successfully completed with corrected parameters
✅ Configuration: Using bf16 dtype with 16K sequence length
✅ Compilation time: ~4.5 minutes for all HLOs
Reference
The Qwen3 235B documentation (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/qwen3/qwen3_moe_235b.html) shows the correct pattern with:
- Proper use of
additional_config with override_neuron_config
- Correct
dtype specification
- Complete online serving commands
- Proper use of
--no-enable-chunked-prefill and --no-enable-prefix-caching flags
The Llama 3.3 70B documentation should follow the same pattern.
Environment
- Instance: trn2.48xlarge (64 NeuronCores)
- vLLM version: 0.13
- Neuron SDK: Latest DLC container (aws_neuronx_venv_pytorch_inference_vllm_0_13)
- Model: meta-llama/Llama-3.3-70B-Instruct
Summary
The Llama 3.3 70B documentation at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/llama3/llama_33_70b.html contains several errors that prevent the examples from working correctly.
Issues Found
1. Offline Serving Example - Incorrect Parameters
Location: Offline serving section for trn2.48xlarge (lines 8-16)
Current Code:
Problems:
device="neuron"- Not needed/supported in vLLM Neuronuse_v2_block_manager=True- Deprecated/incorrect parameteroverride_neuron_config={}- Should beadditional_configwith proper structuredtypespecification (should bebfloat16for best quality)block_sizeparameter (required when prefix caching is enabled)enable_prefix_cachingandenable_chunked_prefillflagsCorrected Code:
2. Online Serving Section - Missing Similar options
Location: Online serving section for trn2.48xlarge
Expected Content:
3. Recommended Configuration - Missing torch_dtype
Location: Both offline and online serving recommended configurations
Problem: The recommended
NeuronConfigexamples don't specifytorch_dtype, which should be"bfloat16"for optimal quality with Llama 3.3 70B.Suggested Addition:
Testing Status
✅ Tested on: trn2.48xlarge instance with 64 NeuronCores
✅ Model compilation: Successfully completed with corrected parameters
✅ Configuration: Using bf16 dtype with 16K sequence length
✅ Compilation time: ~4.5 minutes for all HLOs
Reference
The Qwen3 235B documentation (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/qwen3/qwen3_moe_235b.html) shows the correct pattern with:
additional_configwithoverride_neuron_configdtypespecification--no-enable-chunked-prefilland--no-enable-prefix-cachingflagsThe Llama 3.3 70B documentation should follow the same pattern.
Environment