Skip to content

Add git-base contrib model port#85

Open
dhwanw wants to merge 6 commits intomainfrom
contrib/git-base
Open

Add git-base contrib model port#85
dhwanw wants to merge 6 commits intomainfrom
contrib/git-base

Conversation

@dhwanw
Copy link

@dhwanw dhwanw commented Mar 18, 2026

Summary

  • Adds NeuronX Distributed Inference implementation of microsoft/git-base (Generative Image-to-text Transformer)
  • Both CLIP vision encoder and text decoder compiled on Neuron as separate NEFFs
  • Validated with 95.13% teacher-forced match on 5 random COCO images (20 tokens each)
  • Profiled: 714.4 tok/s on trn1.32xlarge (TP=1)

Model Details

  • Architecture: Vision-language model (BERT-style text decoder + CLIP ViT-B/16)
  • Parameters: ~130M (text) + ~86M (vision)
  • TP Degree: 1
  • Precision: FP32

Validation (Vision+Text)

  • Teacher-forced match: 95.13% (5 random COCO images, 20 tokens each)
  • Greedy match: 46.41% (2/5 images 100%, cascading divergence on others)
  • Throughput: 714.4 tok/s (TP=1, BS=1, seq_len=128)

Files

  • contrib/models/git-base/src/modeling_git.py — Text-only decoder implementation
  • contrib/models/git-base/src/modeling_git_vision.py — Full vision+text implementation
  • contrib/models/git-base/test/ — Integration tests
  • contrib/models/git-base/README.md — Documentation with COCO captioning demo

🤖 Generated with Claude Code

dhwanw and others added 6 commits March 16, 2026 21:33
Port microsoft/git-base text decoder for NeuronX Distributed Inference.
BERT-style architecture with post-LN residual blocks, absolute position
embeddings, and embedding LayerNorm. Vision encoder runs on CPU.

Validated: 100% token match, TTFT 1.28ms, 728 tok/s throughput.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Not needed — validation instructions are in README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both CLIP vision encoder and text decoder now compile on Neuron
via NeuronBaseForImageToText. Key fix: text position embeddings
are offset by num_image_tokens to match HF behavior where text
positions start from 0 regardless of vision token count.

Token match: 95.13% teacher-forced (5 random images, 20 tokens each).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tested with real COCO val2017 photos (cats, skateboarding, bus,
kitchen) plus synthetic images. Neuron captions match or closely
track HF golden model on all test images.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove text-only validation section (not the real use case)
- Keep vision+text results: 95.13% TF / 46.41% greedy on 5 COCO images
- Remove token_match_vision_results.json data artifact

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhwanw dhwanw marked this pull request as ready for review March 19, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant