Issue on page /libraries/nxd-inference/models/llama3/llama_33_70b.html

### Summary
The Llama 3.3 70B documentation at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/llama3/llama_33_70b.html contains several errors that prevent the examples from working correctly.

### Issues Found

#### 1. Offline Serving Example - Incorrect Parameters

**Location:** Offline serving section for trn2.48xlarge (lines 8-16)

**Current Code:**
```python
llm = LLM(
    model="~/models/Llama-3.3-70B-Instruct/",
    tensor_parallel_size=64,
    max_num_seqs=1,
    max_model_len=16384,
    device="neuron",  # ❌ INCORRECT
    use_v2_block_manager=True,  # ❌ INCORRECT
    override_neuron_config={},  # ❌ INCORRECT
)
```

**Problems:**
- `device="neuron"` - Not needed/supported in vLLM Neuron
- `use_v2_block_manager=True` - Deprecated/incorrect parameter
- `override_neuron_config={}` - Should be `additional_config` with proper structure
- Missing `dtype` specification (should be `bfloat16` for best quality)
- Missing `block_size` parameter (required when prefix caching is enabled)
- Missing `enable_prefix_caching` and `enable_chunked_prefill` flags

**Corrected Code:**
```python
llm = LLM(
    model="~/models/Llama-3.3-70B-Instruct/",
    tensor_parallel_size=64,
    max_num_seqs=1,
    max_model_len=16384,
    dtype="bfloat16",
    block_size=16384,
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    additional_config={
        'override_neuron_config': {
            'batch_size': 1,
            'tp_degree': 64,
            'enable_bucketing': True,
            'is_continuous_batching': True,
            'logical_nc_config': 2,
            'seq_len': 16384,
            'torch_dtype': 'bfloat16',
        }
    },
)
```

#### 2. Online Serving Section - Missing Similar options

**Location:** Online serving section for trn2.48xlarge


**Expected Content:**
```bash
VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
    --model="~/models/Llama-3.3-70B-Instruct/" \
    --tensor-parallel-size=64 \
    --max-num-seqs=1 \
    --max-model-len=16384 \
    --dtype=bfloat16 \
    --additional-config='{"override_neuron_config": {"async_mode": true, "batch_size": 1, "tp_degree": 64, "attn_block_tkg_nki_kernel_cache_update": true, "attn_block_tkg_nki_kernel_enabled": true, "attn_kernel_enabled": true, "cc_pipeline_tiling_factor": 1, "enable_bucketing": true, "fused_qkv": true, "is_continuous_batching": true, "k_cache_transposed": true, "kv_cache_tiling": false, "logical_nc_config": 2, "mlp_kernel_enabled": true, "qkv_kernel_enabled": true, "seq_len": 16384, "sequence_parallel_enabled": true, "token_generation_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "context_encoding_buckets": [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384], "on_device_sampling_config": {"do_sample": true, "dynamic": true}, "torch_dtype": "bfloat16"}}' \
    --port=8080
```

#### 3. Recommended Configuration - Missing torch_dtype

**Location:** Both offline and online serving recommended configurations

**Problem:** The recommended `NeuronConfig` examples don't specify `torch_dtype`, which should be `"bfloat16"` for optimal quality with Llama 3.3 70B.

**Suggested Addition:**

### Testing Status

✅ **Tested on:** trn2.48xlarge instance with 64 NeuronCores  
✅ **Model compilation:** Successfully completed with corrected parameters  
✅ **Configuration:** Using bf16 dtype with 16K sequence length  
✅ **Compilation time:** ~4.5 minutes for all HLOs

### Reference

The Qwen3 235B documentation (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/qwen3/qwen3_moe_235b.html) shows the correct pattern with:
- Proper use of `additional_config` with `override_neuron_config`
- Correct `dtype` specification
- Complete online serving commands
- Proper use of `--no-enable-chunked-prefill` and `--no-enable-prefix-caching` flags

The Llama 3.3 70B documentation should follow the same pattern.

### Environment
- Instance: trn2.48xlarge (64 NeuronCores)
- vLLM version: 0.13
- Neuron SDK: Latest DLC container (aws_neuronx_venv_pytorch_inference_vllm_0_13)
- Model: meta-llama/Llama-3.3-70B-Instruct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue on page /libraries/nxd-inference/models/llama3/llama_33_70b.html #1279

Summary

Issues Found

1. Offline Serving Example - Incorrect Parameters

2. Online Serving Section - Missing Similar options

3. Recommended Configuration - Missing torch_dtype

Testing Status

Reference

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue on page /libraries/nxd-inference/models/llama3/llama_33_70b.html #1279

Description

Summary

Issues Found

1. Offline Serving Example - Incorrect Parameters

2. Online Serving Section - Missing Similar options

3. Recommended Configuration - Missing torch_dtype

Testing Status

Reference

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions