Skip to content

Issue on page /libraries/nxd-inference/models/llama3/llama_33_70b.html #1279

@jimburtoft

Description

@jimburtoft

Summary

The Llama 3.3 70B documentation at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/llama3/llama_33_70b.html contains several errors that prevent the examples from working correctly.

Issues Found

1. Offline Serving Example - Incorrect Parameters

Location: Offline serving section for trn2.48xlarge (lines 8-16)

Current Code:

llm = LLM(
    model="~/models/Llama-3.3-70B-Instruct/",
    tensor_parallel_size=64,
    max_num_seqs=1,
    max_model_len=16384,
    device="neuron",  # ❌ INCORRECT
    use_v2_block_manager=True,  # ❌ INCORRECT
    override_neuron_config={},  # ❌ INCORRECT
)

Problems:

  • device="neuron" - Not needed/supported in vLLM Neuron
  • use_v2_block_manager=True - Deprecated/incorrect parameter
  • override_neuron_config={} - Should be additional_config with proper structure
  • Missing dtype specification (should be bfloat16 for best quality)
  • Missing block_size parameter (required when prefix caching is enabled)
  • Missing enable_prefix_caching and enable_chunked_prefill flags

Corrected Code:

llm = LLM(
    model="~/models/Llama-3.3-70B-Instruct/",
    tensor_parallel_size=64,
    max_num_seqs=1,
    max_model_len=16384,
    dtype="bfloat16",
    block_size=16384,
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    additional_config={
        'override_neuron_config': {
            'batch_size': 1,
            'tp_degree': 64,
            'enable_bucketing': True,
            'is_continuous_batching': True,
            'logical_nc_config': 2,
            'seq_len': 16384,
            'torch_dtype': 'bfloat16',
        }
    },
)

2. Online Serving Section - Missing Similar options

Location: Online serving section for trn2.48xlarge

Expected Content:

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
    --model="~/models/Llama-3.3-70B-Instruct/" \
    --tensor-parallel-size=64 \
    --max-num-seqs=1 \
    --max-model-len=16384 \
    --dtype=bfloat16 \
    --additional-config='{"override_neuron_config": {"async_mode": true, "batch_size": 1, "tp_degree": 64, "attn_block_tkg_nki_kernel_cache_update": true, "attn_block_tkg_nki_kernel_enabled": true, "attn_kernel_enabled": true, "cc_pipeline_tiling_factor": 1, "enable_bucketing": true, "fused_qkv": true, "is_continuous_batching": true, "k_cache_transposed": true, "kv_cache_tiling": false, "logical_nc_config": 2, "mlp_kernel_enabled": true, "qkv_kernel_enabled": true, "seq_len": 16384, "sequence_parallel_enabled": true, "token_generation_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "context_encoding_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "on_device_sampling_config": {"do_sample": true, "dynamic": true}, "torch_dtype": "bfloat16"}}' \
    --port=8080

3. Recommended Configuration - Missing torch_dtype

Location: Both offline and online serving recommended configurations

Problem: The recommended NeuronConfig examples don't specify torch_dtype, which should be "bfloat16" for optimal quality with Llama 3.3 70B.

Suggested Addition:

Testing Status

Tested on: trn2.48xlarge instance with 64 NeuronCores
Model compilation: Successfully completed with corrected parameters
Configuration: Using bf16 dtype with 16K sequence length
Compilation time: ~4.5 minutes for all HLOs

Reference

The Qwen3 235B documentation (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/qwen3/qwen3_moe_235b.html) shows the correct pattern with:

  • Proper use of additional_config with override_neuron_config
  • Correct dtype specification
  • Complete online serving commands
  • Proper use of --no-enable-chunked-prefill and --no-enable-prefix-caching flags

The Llama 3.3 70B documentation should follow the same pattern.

Environment

  • Instance: trn2.48xlarge (64 NeuronCores)
  • vLLM version: 0.13
  • Neuron SDK: Latest DLC container (aws_neuronx_venv_pytorch_inference_vllm_0_13)
  • Model: meta-llama/Llama-3.3-70B-Instruct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions