vLLM Manager provides multi-instance vLLM cluster management, automatic log collection, and load balancing.
- Start vLLM: Uses official CLI (
python -m vllm.entrypoints.openai.api_server) - Send Requests: Uses official OpenAI SDK (
from openai import OpenAI) - Cluster Management: Auto start/stop, health checks, failover
- Log Collection: All instance logs automatically saved to files
- π― Multi-Instance Management: Start/stop multiple vLLM instances with one command
- π Automatic Logging: Log files named by model and port for easy identification
- π Failover: Auto-retry on other instances when request fails
- β€οΈ Health Monitoring: Continuous instance health checks
- π§ OpenAI SDK: Returns standard OpenAI client, seamless integration
- βοΈ Load Balancing: Round-robin request distribution
- Python 3.8+
- vLLM - LLM inference engine
- OpenAI SDK - API client
- Requests - HTTP client
# 1. Install vLLM
pip install vllm
# 2. Install dependencies
pip install -r requirements.txt
# Or install individually
pip install openai requestsfrom vllm_manager import VLLMCluster, VLLMInstance
# 1. Create cluster
cluster = VLLMCluster(log_dir="./vllm_logs")
# 2. Add instances
cluster.add_instance(VLLMInstance(
name="server1",
model="facebook/opt-125m",
port=8000,
gpu_memory_utilization=0.5,
))
# 3. Start all instances
cluster.start_all()
# 4. Get OpenAI client
client = cluster.get_openai_client()
# 5. Send requests (auto load-balanced)
response = client.completions.create(
model="facebook/opt-125m",
prompt="San Francisco is a",
)
print(response)
# 6. Stop cluster
cluster.stop_all()from vllm_manager import VLLMCluster, VLLMInstance
cluster = VLLMCluster()
# Add instances with different models
cluster.add_instance(VLLMInstance(
name="qwen-server",
model="Qwen/Qwen2.5-1.5B-Instruct",
port=8000,
))
cluster.add_instance(VLLMInstance(
name="llama-server",
model="meta-llama/Llama-2-7b-chat",
port=8001,
))
cluster.start_all()
# View model name for each instance
for instance in cluster.instances.values():
print(f"{instance.name}: {instance.served_model_name}")
# qwen-server: Qwen2.5-1.5B-Instruct
# llama-server: Llama-2-7b-chat
# Log files automatically include model name
# vllm_Qwen2.5-1.5B-Instruct_8000_20260227_101234.log
# vllm_Llama-2-7b-chat_8001_20260227_101235.logVLLMInstance(
name: str, # Instance name
model: str, # Model name/path
port: int = 8000, # Port
host: str = "0.0.0.0", # Host
log_dir: Optional[Path] = None,
# vLLM parameters (inherited from AsyncEngineArgs)
gpu_memory_utilization: float = 0.9,
tensor_parallel_size: int = 1,
pipeline_parallel_size: int = 1,
max_model_len: Optional[int] = None,
quantization: Optional[str] = None,
dtype: str = "auto",
# ... supports all AsyncEngineArgs parameters
)
# Properties
instance.served_model_name # Model name (last path component)
instance.base_url # http://host:port
instance.api_url # http://host:port/v1
instance.log_file # Log file pathcluster = VLLMCluster(log_dir="./vllm_logs")
cluster.add_instance(instance: VLLMInstance)
cluster.start_all()
cluster.stop_all()
cluster.health_check()
client = cluster.get_openai_client()Log files are named by model name + port + timestamp for easy identification:
./vllm_logs/
βββ vllm_manager_20260227_101234.log # Manager logs
βββ vllm_Qwen2.5-1.5B-Instruct_8000_101235.log # Qwen model
βββ vllm_Llama-2-7b-chat_8001_101236.log # Llama model
from vllm_manager import LogAggregator
aggregator = LogAggregator(log_dir="./vllm_logs")
# Get all logs
logs = aggregator.get_all_logs(limit=100)
for log in logs:
print(f"[{log.timestamp}] {log.instance}: {log.message}")
# Export to JSON
aggregator.export_json("logs.json")A: When you need to run multiple vLLM instances (different models, different GPUs), vLLM Manager provides unified cluster management and log collection.
A: All AsyncEngineArgs parameters are supported, since VLLMInstance inherits from AsyncEngineArgs.
A: Format is vllm_{model_name}_{port}_{timestamp}.log, where model_name is the last component of the model path (e.g., Qwen2.5-1.5B-Instruct).
A: Use the instance.served_model_name property:
for instance in cluster.instances.values():
print(f"{instance.name}: {instance.served_model_name}")Issues and Pull Requests are welcome!
# 1. Fork the repo
# 2. Create your branch (git checkout -b feature/AmazingFeature)
# 3. Commit your changes (git commit -m 'Add some AmazingFeature')
# 4. Push to the branch (git push origin feature/AmazingFeature)
# 5. Open a Pull RequestMIT License - See LICENSE file for details.
- Project URL: https://github.com/AiKiAi-stack/vllm_startup
- Issues: https://github.com/AiKiAi-stack/vllm_startup/issues
- Author: AiKiAi-stack
- vLLM - LLM inference engine
- OpenAI Python SDK - API client