Skip to content

refactor(vLLM): Move video support from example to backend#7663

Merged
rmccorm4 merged 15 commits intomainfrom
rmccormick/vllm-video
Apr 2, 2026
Merged

refactor(vLLM): Move video support from example to backend#7663
rmccorm4 merged 15 commits intomainfrom
rmccormick/vllm-video

Conversation

@rmccorm4
Copy link
Copy Markdown
Contributor

@rmccorm4 rmccorm4 commented Mar 27, 2026

Overview:

  • replace model-name allowlists with capability-driven vision loading and multimodal handling
  • add native video_url loading in the standard TokensPrompt multi_modal_data flow
  • move the video agg/disagg launch scripts under examples/backends/vllm and update docs/tests
  • Remove old Llava video model support for simplicity until explicitly requested

Details:

Quick Benchmark: Dynamo vs vllm serve for Video Inference

I ran a quick apples-to-apples comparison between Dynamo aggregate mode (examples/backends/vllm/launch/video_agg.sh) and plain vllm serve, both serving Qwen/Qwen2-VL-2B-Instruct on the same machine and GPU configuration.

Dynamo command:

bash examples/backends/vllm/launch/video_agg.sh \
    --model Qwen/Qwen3-VL-2B-Instruct

vllm serve command:

vllm serve \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --served-model-name Qwen/Qwen3-VL-2B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --allowed-local-media-path / \
    --limit-mm-per-prompt '{"video":1}' \
    --media-io-kwargs '{"video":{"num_frames":32}}'

Benchmark command:

aiperf profile \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --url localhost:8000 \
    --video-width 640 \
    --video-height 480 \
    --video-fps 4 \
    --video-duration 5.0 \
    --video-format mp4 \
    --video-codec libx264 \
    --request-count 20 \
    --concurrency 1 \
    --osl 1200 \
    --osl-stddev 0 \
    --extra-inputs '{"ignore_eos": true, "min_tokens": 1200}' \
    --use-server-token-count \
    --ui none \
    --no-server-metrics \
    --no-gpu-telemetry

Both runs completed successfully with identical prompt/completion lengths:

  • Average ISL: 962
  • Average OSL: 1200
  • Success rate: 20/20
Deployment Concurrency Avg latency Req/s Output tok/s Benchmark duration
vLLM serve 1 6459.49 ms 0.1548 185.72 129.23 s
Dynamo + vLLM 1 6483.57 ms 0.1542 185.02 129.71 s
vLLM serve 2 7295.91 ms 0.2740 328.85 72.98 s
Dynamo + vLLM 2 7341.10 ms 0.2724 326.82 73.43 s

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Future

  • Can probably condense image and video bash scripts into single vision script as the Qwen3-VL model should work for both cases

Summary by CodeRabbit

Release Notes

  • New Features

    • Video input support now integrated into vLLM multimodal backend with configurable frame sampling
    • Unified video serving available for both aggregated and disaggregated inference modes
  • Documentation

    • Updated multimodal documentation to reflect video support in vLLM backend
    • Added launch examples for video-enabled deployments
  • Chores

    • Removed legacy video encoding components
    • Updated example configurations to use standardized video infrastructure

- replace model-name allowlists with capability-driven vision loading and multimodal handling
- add native video_url loading in the standard TokensPrompt multi_modal_data flow
- move the video agg/disagg launch scripts under examples/backends/vllm and update docs/tests
@github-actions github-actions Bot added refactor documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend multimodal labels Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

@rmccorm4
Copy link
Copy Markdown
Contributor Author

rmccorm4 commented Apr 1, 2026

/ok to test 3057755

@rmccorm4 rmccorm4 marked this pull request as ready for review April 2, 2026 01:15
@rmccorm4 rmccorm4 requested a review from a team as a code owner April 2, 2026 01:15
Copy link
Copy Markdown
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left two minor comments which are not blocking. Great to to see the quick benchmark result for the video pipeline.

Comment thread components/src/dynamo/common/tests/multimodal/test_video_loader.py
Comment thread examples/multimodal/utils/model.py
@github-actions github-actions Bot added the backend::sglang Relates to the sglang backend label Apr 2, 2026
Comment thread components/src/dynamo/common/multimodal/video_loader.py
@rmccorm4 rmccorm4 enabled auto-merge (squash) April 2, 2026 23:22
@rmccorm4 rmccorm4 merged commit 4791aaa into main Apr 2, 2026
91 of 92 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick/vllm-video branch April 2, 2026 23:22
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Refactor AudioLoader to delegate to vLLM's MediaConnector + AudioMediaIO,
matching the VideoLoader pattern from PR #7663. Returns (waveform, sample_rate)
tuples at native sample rate so vLLM's model-specific MultiModalDataParser
handles resampling and normalization downstream.

Integrate AudioLoader into BaseWorkerHandler._extract_multimodal_data() so
audio_url content parts flow through to vLLM's engine for omni models
(Qwen3-Omni, Nemotron Omni).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Delete examples/multimodal/utils/audio_loader.py — the backend
AudioLoader in components/ now handles all audio loading. Update the
example encode worker import to use the components package.

Matches the pattern from PR #7663 which removed the example video loader
when video support moved into the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Delete examples/multimodal/utils/audio_loader.py — the backend
AudioLoader in components/ now handles all audio loading. Update the
example encode worker import to use the components package.

Matches the pattern from PR #7663 which removed the example video loader
when video support moved into the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Refactor AudioLoader to delegate to vLLM's MediaConnector + AudioMediaIO,
matching the VideoLoader pattern from PR #7663. Returns (waveform, sample_rate)
tuples at native sample rate so vLLM's model-specific MultiModalDataParser
handles resampling and normalization downstream.

Integrate AudioLoader into BaseWorkerHandler._extract_multimodal_data() so
audio_url content parts flow through to vLLM's engine for omni models
(Qwen3-Omni, Nemotron Omni).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Delete examples/multimodal/utils/audio_loader.py — the backend
AudioLoader in components/ now handles all audio loading. Update the
example encode worker import to use the components package.

Matches the pattern from PR #7663 which removed the example video loader
when video support moved into the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Refactor AudioLoader to delegate to vLLM's MediaConnector + AudioMediaIO,
matching the VideoLoader pattern from PR #7663. Returns (waveform, sample_rate)
tuples at native sample rate so vLLM's model-specific MultiModalDataParser
handles resampling and normalization downstream.

Integrate AudioLoader into BaseWorkerHandler._extract_multimodal_data() so
audio_url content parts flow through to vLLM's engine for omni models
(Qwen3-Omni, Nemotron Omni).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 7, 2026
Delete examples/multimodal/utils/audio_loader.py — the backend
AudioLoader in components/ now handles all audio loading. Update the
example encode worker import to use the components package.

Matches the pattern from PR #7663 which removed the example video loader
when video support moved into the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 9, 2026
Refactor AudioLoader to delegate to vLLM's MediaConnector + AudioMediaIO,
matching the VideoLoader pattern from PR #7663. Returns (waveform, sample_rate)
tuples at native sample rate so vLLM's model-specific MultiModalDataParser
handles resampling and normalization downstream.

Integrate AudioLoader into BaseWorkerHandler._extract_multimodal_data() so
audio_url content parts flow through to vLLM's engine for omni models
(Qwen3-Omni, Nemotron Omni).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nealvaidya added a commit that referenced this pull request Apr 9, 2026
Delete examples/multimodal/utils/audio_loader.py — the backend
AudioLoader in components/ now handles all audio loading. Update the
example encode worker import to use the components package.

Matches the pattern from PR #7663 which removed the example video loader
when video support moved into the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend backend::vllm Relates to the vllm backend documentation Improvements or additions to documentation multimodal refactor size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants