Skip to content

Commit 3ddb397

Browse files
committed
Chane the vllm-module.ipynb output error ,vllm_app.
1 parent 0fc22a9 commit 3ddb397

4 files changed

Lines changed: 423 additions & 0 deletions

File tree

modules/src/vllm_module/item.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
apiVersion: v1
2+
categories:
3+
- genai
4+
description: Deploys a vLLM OpenAI-compatible LLM server as an MLRun application runtime, with configurable GPU usage, node selection, tensor parallelism, and runtime flags.
5+
example: vllm_module.ipynb
6+
generationDate: 2025-12-17:12-25
7+
hidden: false
8+
labels:
9+
author: Iguazio
10+
mlrunVersion: 1.10.0
11+
name: vllm_module
12+
spec:
13+
filename: vllm_module.py
14+
image: mlrun/mlrun
15+
kind: generic
16+
version: 1.0.0
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Copyright 2025 Iguazio
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
16+
from vllm_module import VLLMModule
17+
import mlrun
18+
19+
20+
class TestVllmModule:
21+
"""Test suite for VLLMModule class."""
22+
23+
def setup_method(self):
24+
project = mlrun.new_project("vllm", save=False)
25+
26+
# if your VLLMModule requires node_selector as keyword-only, keep it here
27+
self.TestVllmModule = VLLMModule(
28+
project,
29+
node_selector={"alpha.eksctl.io/nodegroup-name": "added-gpu"},
30+
)
31+
32+
def test_vllm_module(self):
33+
assert (
34+
type(self.TestVllmModule.vllm_app) == mlrun.runtimes.nuclio.application.application.ApplicationRuntime
35+
)
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "7d551647-dfc2-47da-bc8a-3792af622073",
6+
"metadata": {},
7+
"source": [
8+
"# vLLM Module with MLRun\n",
9+
"\n",
10+
"This notebook shows how to configure and deploy a vLLM OpenAI compatible server as an MLRun application runtime, then showcases how to send a chat request to it to the vLLM server."
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": 1,
16+
"id": "7707b270-30cc-448a-a828-cb93aa28030d",
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import mlrun\n"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"id": "d5cff681-bfdf-4468-a1d1-2aeadb56065e",
26+
"metadata": {},
27+
"source": [
28+
"## Prerequisite\n",
29+
"* At lease one GPU is required for running this notebook."
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"id": "d5c84798-289f-4b4f-8c1b-f4dd12a3bda5",
35+
"metadata": {},
36+
"source": [
37+
"## What this notebook does\n",
38+
"\n",
39+
"In this notebook we will:\n",
40+
"\n",
41+
"- Create or load an **MLRun project**\n",
42+
"- Import a custom **vLLM module** from the MLRun Hub\n",
43+
"- Deploy a **vLLM OpenAI-compatible server** as an MLRun application runtime\n",
44+
"- Configure deployment parameters such as model, GPU count, memory, node selector, port, and log level\n",
45+
"- Invoke the deployed service using the `/v1/chat/completions` endpoint\n",
46+
"- Parse the response and extract only the assistant’s generated text\n",
47+
"\n",
48+
"By the end of this notebook, you will have a working vLLM deployment that can be queried directly from a Jupyter notebook using OpenAI-style APIs.\n",
49+
"\n",
50+
"For more information about [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/)"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"id": "879ca641-ee35-4682-9995-4eb319d89090",
56+
"metadata": {},
57+
"source": [
58+
"## 1. Create an MLRun project\n",
59+
"\n",
60+
"In this section we create or load an MLRun project that will own the deployed vLLM application runtime."
61+
]
62+
},
63+
{
64+
"cell_type": "code",
65+
"execution_count": null,
66+
"id": "6eac263a-17d1-4454-9e19-459dfbe2f231",
67+
"metadata": {},
68+
"outputs": [],
69+
"source": [
70+
"project = mlrun.get_or_create_project(name=\"vllm-module\", context=\"\", user_project=True)"
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"id": "da49d335-b704-4fb6-801f-4d07b64f9be6",
76+
"metadata": {},
77+
"source": [
78+
"## 2. Import the vLLM module from the MLRun Hub\n",
79+
"\n",
80+
"In this section we import the vLLM module from the MLRun Hub so we can instantiate `VLLMModule` and deploy it as an application runtime."
81+
]
82+
},
83+
{
84+
"cell_type": "code",
85+
"execution_count": null,
86+
"id": "e6d89dee-db58-4c0c-8009-b37020c9599a",
87+
"metadata": {},
88+
"outputs": [],
89+
"source": [
90+
"vllm = mlrun.import_module(\"hub://vllm-module\")"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"id": "1202ddd5-0ce7-4769-be29-8fc264c1f80e",
96+
"metadata": {},
97+
"source": [
98+
"## 3. Deploy the vLLM application runtime\n",
99+
"\n",
100+
"Configure the vLLM deployment parameters and deploy the application.\n",
101+
"\n",
102+
"The returned address is the service URL for the application runtime."
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"id": "e433123a-e64b-4a7a-8c7f-8165bcdcc6d1",
109+
"metadata": {},
110+
"outputs": [],
111+
"source": [
112+
"# Initialize the vLLM app\n",
113+
"vllm_module = vllm.VLLMModule(\n",
114+
" project=project,\n",
115+
" node_selector={\"alpha.eksctl.io/nodegroup-name\": \"added-gpu\"},\n",
116+
" name=\"qwen-vllm\",\n",
117+
" image=\"vllm/vllm-openai:latest\",\n",
118+
" model=\"Qwen/Qwen2.5-Omni-3B\",\n",
119+
" gpus=1,\n",
120+
" mem=\"10G\",\n",
121+
" port=8000,\n",
122+
" dtype=\"auto\",\n",
123+
" uvicorn_log_level=\"info\",\n",
124+
" max_tokens = 501,\n",
125+
")\n",
126+
"\n",
127+
"# Deploy the vLLM app\n",
128+
"addr = vllm_module.vllm_app.deploy(with_mlrun=True)\n",
129+
"addr"
130+
]
131+
},
132+
{
133+
"cell_type": "markdown",
134+
"id": "06832de3-5c31-43bf-b07b-0e71fb2d072d",
135+
"metadata": {},
136+
"source": [
137+
"## 4. Get the runtime handle\n",
138+
"\n",
139+
"Fetch the runtime object and invoke the service using `app.invoke(...)`."
140+
]
141+
},
142+
{
143+
"cell_type": "code",
144+
"execution_count": null,
145+
"id": "102d3fd0-1ee6-49b8-8c86-df742ac1c559",
146+
"metadata": {},
147+
"outputs": [],
148+
"source": [
149+
"# Optional: get_runtime() method uses to get the MLRun application runtime\n",
150+
"app = vllm_module.get_runtime()"
151+
]
152+
},
153+
{
154+
"cell_type": "markdown",
155+
"id": "925730c1-0ac5-454b-8fb2-ab8cebb3f3ac",
156+
"metadata": {},
157+
"source": [
158+
"## 5. Send a chat request for testing\n",
159+
"\n",
160+
"Call the OpenAI compatible endpoint `/v1/chat/completions`, parse the JSON response, and print only the assistant message text."
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": 28,
166+
"id": "31bc78d4-1c6f-439c-b894-1522e3a6d3e6",
167+
"metadata": {},
168+
"outputs": [],
169+
"source": [
170+
"body = {\n",
171+
" \"model\": vllm_module.model,\n",
172+
" \"messages\": [{\"role\": \"user\", \"content\": \"what are the 3 countries with the most gpu as far as you know\"}],\n",
173+
" \"max_tokens\": vllm_module.max_tokens, # start smaller for testing\n",
174+
"}\n",
175+
"\n",
176+
"resp = app.invoke(path=\"/v1/chat/completions\", body=body)"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": 22,
182+
"id": "a459d5f8-dad0-4735-94c2-3801d4f94bb5",
183+
"metadata": {},
184+
"outputs": [
185+
{
186+
"name": "stdout",
187+
"output_type": "stream",
188+
"text": [
189+
"\n",
190+
"assistant:\n",
191+
"\n",
192+
"As of the most commonly cited estimates, the three countries with the largest GPU capacity for AI workloads are the United States, China, and India.\n"
193+
]
194+
}
195+
],
196+
"source": [
197+
"data = resp\n",
198+
"assistant_text = data[\"choices\"][0][\"message\"][\"content\"]\n",
199+
"\n",
200+
"print(\"\\nassistant:\\n\")\n",
201+
"print(assistant_text.strip())"
202+
]
203+
},
204+
{
205+
"cell_type": "code",
206+
"execution_count": null,
207+
"id": "957b5d21-7ade-4131-9100-878652c477fc",
208+
"metadata": {},
209+
"outputs": [],
210+
"source": []
211+
}
212+
],
213+
"metadata": {
214+
"kernelspec": {
215+
"display_name": "mlrun-base",
216+
"language": "python",
217+
"name": "conda-env-mlrun-base-py"
218+
},
219+
"language_info": {
220+
"codemirror_mode": {
221+
"name": "ipython",
222+
"version": 3
223+
},
224+
"file_extension": ".py",
225+
"mimetype": "text/x-python",
226+
"name": "python",
227+
"nbconvert_exporter": "python",
228+
"pygments_lexer": "ipython3",
229+
"version": "3.9.22"
230+
}
231+
},
232+
"nbformat": 4,
233+
"nbformat_minor": 5
234+
}

0 commit comments

Comments
 (0)