Add git-base contrib model port by dhwanw · Pull Request #85 · aws-neuron/neuronx-distributed-inference

dhwanw · 2026-03-18T01:44:35Z

Summary

Adds NeuronX Distributed Inference implementation of microsoft/git-base (Generative Image-to-text Transformer)
Both CLIP vision encoder and text decoder compiled on Neuron as separate NEFFs
Validated with 95.13% teacher-forced match on 5 random COCO images (20 tokens each)
Profiled: 714.4 tok/s on trn1.32xlarge (TP=1)

Model Details

Architecture: Vision-language model (BERT-style text decoder + CLIP ViT-B/16)
Parameters: ~130M (text) + ~86M (vision)
TP Degree: 1
Precision: FP32

Validation (Vision+Text)

Teacher-forced match: 95.13% (5 random COCO images, 20 tokens each)
Greedy match: 46.41% (2/5 images 100%, cascading divergence on others)
Throughput: 714.4 tok/s (TP=1, BS=1, seq_len=128)

Files

contrib/models/git-base/src/modeling_git.py — Text-only decoder implementation
contrib/models/git-base/src/modeling_git_vision.py — Full vision+text implementation
contrib/models/git-base/test/ — Integration tests
contrib/models/git-base/README.md — Documentation with COCO captioning demo

🤖 Generated with Claude Code

Port microsoft/git-base text decoder for NeuronX Distributed Inference. BERT-style architecture with post-LN residual blocks, absolute position embeddings, and embedding LayerNorm. Vision encoder runs on CPU. Validated: 100% token match, TTFT 1.28ms, 728 tok/s throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Not needed — validation instructions are in README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both CLIP vision encoder and text decoder now compile on Neuron via NeuronBaseForImageToText. Key fix: text position embeddings are offset by num_image_tokens to match HF behavior where text positions start from 0 regardless of vision token count. Token match: 95.13% teacher-forced (5 random images, 20 tokens each). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tested with real COCO val2017 photos (cats, skateboarding, bus, kitchen) plus synthetic images. Neuron captions match or closely track HF golden model on all test images. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove text-only validation section (not the real use case) - Keep vision+text results: 95.13% TF / 46.41% greedy on 5 COCO images - Remove token_match_vision_results.json data artifact Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dhwanw and others added 6 commits March 16, 2026 21:33

Add performance profiling metrics to README

42299bb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove validation shell script from contrib package

5fd9516

Not needed — validation instructions are in README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add COCO image captioning results to vision+text README

8e86489

Tested with real COCO val2017 photos (cats, skateboarding, bus, kitchen) plus synthetic images. Neuron captions match or closely track HF golden model on all test images. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dhwanw marked this pull request as ready for review March 19, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add git-base contrib model port#85

Add git-base contrib model port#85
dhwanw wants to merge 6 commits intomainfrom
contrib/git-base

dhwanw commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhwanw commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model Details

Validation (Vision+Text)

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhwanw commented Mar 18, 2026 •

edited

Loading