[Data] Apply DataProto to vLLM Inference & Align API with SGLang by wheresmyhair · Pull Request #967 · OptimalScale/LMFlow

wheresmyhair · 2026-04-11T11:42:03Z

Overview

Apply DataProto to vllm inference pipeline, aligning its API with the sglang inferencer introduced in Unified data exchange protocol across modules #960. This unifies data exchange across inference engines and modernizes the vllm integration.
Remove Ray dependency in vllm, paving the way for a Ray-less lmflow implementation.

Detailed Description

DataProto integration

VLLMInferencer now returns DataProto instead of list[VLLMInferenceResultWithInput], with prompts in non_tensor_batch["inputs"] and generated text in non_tensor_batch["outputs"]
prepare_inputs_for_inference creates DataProto for both sglang and vllm through a unified code path
__vllm_inference in HFDecoderModel extracts prompts and sampling params from DataProto, converts to vllm.SamplingParams, and stores outputs back into the proto
Inference results are saved/loaded as pickle via DataProto.save_to_disk / load_from_disk
inference_results_path now accepts a directory — results are automatically saved as inference_results.pkl inside it

API alignment with sglang and modernization

VLLMInferencer now mirrors SGLangInferencer
Removed InferencerWithOffloading base class and all Ray-based distributed inference code -- vllm >= 0.8 supports data_parallel_size natively in vllm.LLM(), using a multiprocessing backend with no Ray dependency
Added --inference_data_parallel_size argument
Total GPUs used = tensor_parallel_size × data_parallel_size
Removed use_beam_search from sampling params (dropped in vLLM V1), added deprecation warning
Fixed deactivate_model_for_inference — old cleanup code referenced llm_engine.model_executor.driver_worker which no longer exists in V1
Added --inference_max_model_len to cap context length (prompt and output) for models with large defaults
Bumped vllm version constraint from >=0.4.3 to >=0.8.0 in setup.py

Files changed

File	Change
`src/lmflow/pipeline/vllm_inferencer.py`	Major rewrite: DataProto, aligned API, native DP
`src/lmflow/models/hf_decoder_model.py`	DataProto for vllm, unified prepare_inputs
`src/lmflow/models/hf_model_mixin.py`	DP, max_model_len, V1-compatible deactivation
`src/lmflow/args.py`	New args, dir-based results path
`src/lmflow/pipeline/sglang_inferencer.py`	Dir-based results path
`src/lmflow/pipeline/utils/memory_safe_vllm_inference.py`	Simplified to new API
`examples/vllm_inference.py`	Simplified to match sglang pattern
`scripts/run_vllm_inference.sh`	New script
`scripts/run_sglang_inference.sh`	Updated results path
`setup.py`	vllm >= 0.8.0
`tests/pipeline/test_vllm_inferencer.py`	New, 8 tests

Downstream impact

MemorySafeVLLMInferencer is updated to return DataProto. iterative_dpo_aligner.py consumes MemorySafeVLLMInferencer and will need a separate update to handle DataProto instead of list[VLLMInferenceResultWithInput].

Tests

6 unit tests pass (no GPU): sampling params parsing, DataProto save/load round-trip, DataProto repeat logic
2 GPU integration tests pass: full inference pipeline + save/load with Qwen3-0.6B on RTX 4090
Run scripts/run_vllm_inference.sh end-to-end with target model

Jingyuan-zhu · 2026-04-23T03:11:14Z

Code review

Found 1 issue:

examples/vllm_inference.py was flipped to release_gpu=True, but the method it invokes (HFModelMixin.deactivate_model_for_inference) still documents itself as a placeholder that "cannot release all gpu resources by our observation" for vLLM. Either the docstring is stale (and should be updated to reflect that vLLM >= 0.8 now releases properly, matching the new setup.py pin) or the example is misleading users. Also note that MemorySafeVLLMInferencer was added in this same PR precisely because in-process release was inadequate — worth a comment explaining when a user should pick release_gpu=True vs. MemorySafeVLLMInferencer.

LMFlow/examples/vllm_inference.py

Lines 40 to 45 in dee43cf

    
           res = inferencer.inference( 
        
               model, 
        
               dataset, 
        
               release_gpu=True, 
        
           )

LMFlow/src/lmflow/models/hf_model_mixin.py

Lines 559 to 581 in dee43cf

    
               ): 
        
                   """Deactivate the model and release the resources. 
        
                   NOTE: Currently, VLLM doesn't have an official way to do this, and the 
        
                   implementation below cannot release all gpu resources by our observation. 
        
                   Thus this method is just a placeholder for future implementation. See: 
        
                   [Github issue](https://github.com/vllm-project/vllm/issues/1908) 
        
                   """ 
        
                   if not self._activated: 
        
                       logger.warning("You are trying to deactivate the model for inference, but it is already deactivated.") 
        
                       return 
        
                   if inference_engine == "vllm": 
        
                       # vllm still cannot fully release GPU memory in-process. 
        
                       # See: https://github.com/vllm-project/vllm/issues/1908 
        
                       try: 
        
                           from vllm.distributed.parallel_state import destroy_model_parallel 
        
                           destroy_model_parallel() 
        
                       except Exception: 
        
                           pass 
        
                       del self.backend_model_for_inference 
        
                       gc.collect() 
        
                       torch.cuda.empty_cache()

Review prepared by @Jingyuan-zhu.

wheresmyhair · 2026-05-06T17:12:28Z

Updated the deactivate_model_for_inference docstring to describe vllm>=0.8 best-effort GPU release behavior, and clarified when to use release_gpu=True versus MemorySafeVLLMInferencer (single-GPU inference vs. TP>1 / colocated training+inference).
Added a matching note in examples/vllm_inference.py.

Jingyuan-zhu · 2026-05-07T04:12:20Z

Re-checked at 7408430. The docstring/example mismatch I flagged is fully fixed: deactivate_model_for_inference now scopes when in-process release is reliable (vllm >= 0.8, single-GPU, inference-only) vs. when MemorySafeVLLMInferencer should be used (tp_size > 1, CUDA graphs, colocated training+inference), and examples/vllm_inference.py carries a matching comment. Thanks for the quick turnaround.

— @Jingyuan-zhu

research4pan

LGTM

[Data] Apply DataProto to vLLM Inference & Align API with SGLang

dee43cf

[vllm] docstring update

7408430

research4pan approved these changes May 7, 2026

View reviewed changes

wheresmyhair merged commit 3cd8601 into main May 7, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Apply DataProto to vLLM Inference & Align API with SGLang#967

[Data] Apply DataProto to vLLM Inference & Align API with SGLang#967
wheresmyhair merged 2 commits intomainfrom
lmflow-vllm-dataproto

wheresmyhair commented Apr 11, 2026 •

edited

Loading

Uh oh!

Jingyuan-zhu commented Apr 23, 2026

Uh oh!

wheresmyhair commented May 6, 2026

Uh oh!

Jingyuan-zhu commented May 7, 2026

Uh oh!

research4pan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wheresmyhair commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Detailed Description

DataProto integration

API alignment with sglang and modernization

Files changed

Downstream impact

Tests

Uh oh!

Jingyuan-zhu commented Apr 23, 2026

Code review

Uh oh!

wheresmyhair commented May 6, 2026

Uh oh!

Jingyuan-zhu commented May 7, 2026

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wheresmyhair commented Apr 11, 2026 •

edited

Loading