Open
Conversation
Extends NeuronLlamaForCausalLM with custom weight conversion for Baichuan2's W_pack fused QKV split and NormHead lm_head normalization. Bypasses trust_remote_code by loading config.json and safetensors directly. Validated: 54.84% greedy, 98.59% teacher-forced (TP=2, bs=1, seq=128, bf16). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SLURM job 7871 confirms contrib model passes token matching: - Greedy match: 54.84% (351/640 tokens, >= 50% threshold) - Teacher-forced match: 98.59% (>= 95% threshold) - 4/10 prompts at 100% greedy match Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Apache license header and expanded docstring to modeling file - Add copyright header to __init__.py - Add standard helper functions (load_neuron_config_from_compiled, create_model_for_inference) to test_model.py - Add generation_config fixture and performance tests (TTFT, throughput) - Use /home/ubuntu/ path convention for MODEL_PATH/COMPILED_MODEL_PATH - Use standard __main__ block with create_model_for_inference - Simplify README architecture section to match Llama-2 format - Add manual run instructions to README Testing section - Remove non-standard files: test_token_match.py, run_validation.sh, validation_7871.out Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use consistent CE/TG column table format across all contrib models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
NeuronX Distributed Inference port of baichuan-inc/Baichuan2-7B-Base, a Llama-2 architecture variant with fused W_pack QKV weights and NormHead lm_head (L2-normalized). The port handles direct loading to bypass
trust_remote_code, fused QKV decomposition, and pre-normalized lm_head weight conversion.Model Information
Model Name: Baichuan2-7B-Base
Model Architecture: Decoder-only transformer (Llama-2 variant) -- 32 layers, 32 MHA heads (head_dim=128), fused W_pack QKV, NormHead lm_head with L2 normalization
Purpose: Multilingual text generation (Chinese/English)
Checklist
Required Components
test/integration/test_model.py)src/)Optional Components
Folder Structure
Testing
Model was compiled and tested with TP=2, batch_size=1, seq_len=256, bfloat16 on trn1.32xlarge.
Test Results:
The high teacher-forced rate confirms the model is functionally correct. Lower greedy match on some prompts is due to BF16 precision causing early divergence that cascades into different generation paths.
Compatibility
Tested with:
Additional Information
W_pack.weight [3*H, H], split into separate projections during weight conversion.trust_remote_codeby loading config.json and safetensors directly, adding missing Llama-required keys.Related Issues
N/A
vLLM Integration
By submitting this PR, I confirm that: