Distributed model inference server with a router and multiple workers.
On macOS, keep protobuf ABI aligned with grpc + onnxruntime from Homebrew.
cmake -S . -B build
cmake --build buildUse the helper script after build:
chmod +x scripts/start_cluster.sh
scripts/start_cluster.sh 2 resnet50 models/resnet50-v1-7.onnxThis starts:
- workers on
127.0.0.1:50052,127.0.0.1:50053, ... - router on
127.0.0.1:50051
Install Python dependencies:
python -m pip install -r requirements.txtGenerate Python gRPC stubs (if needed):
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/inference.protoRun the client:
python client.py --router localhost:50051 --model_id resnet50 --tokens 1.0,2.0,3.0,4.0Workload used for comparison:
- model:
resnet50 - input shape:
1 x 3 x 224 x 224 - total requests:
200 - concurrency:
50threads
Measured results from the latest run:
| Workers | Throughput (req/s) | Success | Errors |
|---|---|---|---|
| 1 | 10.512 | 200 | 0 |
| 3 | 22.259 | 200 | 0 |
| 5 | 36.883 | 200 | 0 |
Highlights:
- Request handling is stable (0 errors in all scenarios).
- Throughput now increases with worker count in this environment.
- Router dispatch is balanced across workers in multi-worker runs.
To reproduce this benchmark:
python scripts/throughput_benchmark.pyTo evaluate scale under a latency SLO/SLA:
python scripts/scaling_benchmark.py --workers 1,3,5 --concurrency-levels 20,40,60,80 --total-requests 180 --warmup 6 --sla-p95-ms 1500