Find the maximum queries per second (QPS) an LLM inference deployment can deliver on Alps, subject to latency constraints (e.g. TTFT p95 ≤ 500 ms). A deployment can be a single GPU (e.g. Apertus 8B), multiple GPUs on one node (e.g. Apertus 70B with TP=4), or multiple nodes (e.g. a 405B model across 4 nodes with TP=16). Workloads can be deployed on SLURM (clariden) or Kubernetes (breithorn). Everything is orchestrated from your laptop — no coordinator node is allocated on Alps.
See ARCHITECTURE.md for a full design description.
The tool runs an open-loop Poisson rate sweep: requests are sent at a fixed arrival rate λ (req/s), independent of whether previous requests have completed. As λ increases, the server queue fills and latency rises. The highest λ at which TTFT p95 stays below the SLO is the max QPS at SLO.
Why open-loop? A closed-loop design (keep N requests in-flight) artificially caps concurrency and hides queuing effects. Open-loop with Poisson arrivals faithfully reproduces what real users experience when the system approaches saturation.
For each experiment the coordinator:
- Uploads sbatch scripts, EDF files, and the load generator to
/capstor/scratch/cscs/stefschu/via FirecREST - Submits a server job (inference server on GPU nodes) and a benchmarker job (load generator on CPU nodes)
- Polls job status every N seconds; once the server is healthy it triggers the benchmarker
- The benchmarker sweeps rate levels, writing per-request metrics to a SQLite DB on capstor
- On completion the coordinator downloads the DB and runs the next experiment
| Requirement | Notes |
|---|---|
| Python ≥ 3.11 | On your laptop |
uv |
Package manager |
kubectl |
Configured for the Alps K8s cluster |
| FirecREST MCP server | Running locally (see below) |
| Alps account | csstaff or similar |
# From the repo root
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[analysis]"The coordinator talks to Alps SLURM via a local FirecREST MCP server. Start it once per session:
source .venv/bin/activate
uv run firecrest-mcp/server.py
# Server listens at http://localhost:8888/mcpCredentials are read from .env:
# .env (git-ignored)
OAUTH_CLIENT_ID=...
OAUTH_CLIENT_SECRET=...
OAUTH_TOKEN_URL=https://auth.cscs.ch/auth/realms/cscs/protocol/openid-connect/token
OAUTH_SCOPES=openid
BACKEND_API_BASE_URL=https://firecrest.cscs.chRegister the MCP server with Claude Code (one-time setup):
claude mcp add firecrest http://localhost:8888/mcp --transport httpAll benchmark parameters live in a single YAML file. The key fields:
defaults:
account: csstaff
partition: debug # server job partition (GPU)
benchmarker_partition: normal # load generator partition (CPU, avoids QOS collision)
scratch_base: /capstor/scratch/cscs/stefschu/benchmark-runs
hf_token_path: ~/.hf_token
hf_home: /capstor/scratch/cscs/stefschu/hf-cache
warmup_duration_s: 30 # requests sent but not measured
measurement_duration_s: 60 # measured window per rate level
rate_levels: [5, 10, 20, 40, 60, 80, 100, 150] # req/s to sweep
latency_slos:
ttft_p95_ms: 500 # time-to-first-token SLO
e2e_p95_ms: 10000 # end-to-end latency SLO (informational)
gpus_per_node: 4 # GPUs per node on clariden (GH200); used to derive --nodes
port: 8080
server_time_limit: "01:00:00"
benchmarker_time_limit: "01:00:00"
max_parallel_experiments: 2 # how many (model, config) pairs run simultaneously
server_ready_timeout_s: 900 # abort if server doesn't start within 15 min
k8s_namespace: ml
ingress_domain: swissai.svc.cscs.ch
dataset:
sharegpt_fraction: 0.7 # 70% chat (ShareGPT), 30% coding (CodeAlpaca)
num_prompts: 500 # total prompts to pre-sample
input_length:
distribution: normal # or: uniform, fixed
mean: 512
std: 200
output_length:
distribution: normal
mean: 256
std: 128
experiments:
- model: swiss-ai/Apertus-8B-Instruct-2509
backend: vllm # vllm | sglang | dynamo
deployment: slurm # slurm | k8s
backend_config:
tensor_parallel_size: 1
max_model_len: 32768
gpu_memory_utilization: 0.90
# Multi-node example: 16-GPU deployment across 4 nodes (tensor_parallel_size=16,
# gpus_per_node=4 → 4 nodes allocated, Ray bootstrapped automatically)
# - model: meta-llama/Llama-3.1-405B-Instruct
# backend: vllm
# deployment: slurm
# backend_config:
# tensor_parallel_size: 16
# max_model_len: 32768
# gpu_memory_utilization: 0.90
# speculative_decoding: # optional
# draft_model: swiss-ai/Apertus-1B-Draft
# num_speculative_tokens: 5| Parameter | distribution: normal |
distribution: uniform |
distribution: fixed |
|---|---|---|---|
| Required fields | mean, std |
min, max |
value |
| Token count | sampled from N(mean, std) | sampled from U(min, max) | exactly value |
Input tokens approximate the prompt length. Output tokens are passed as max_tokens to the server (actual output may be shorter if the model hits EOS).
Why real data matters for some tests: Synthetic prompts are sufficient for throughput benchmarking. However, for speculative decoding acceptance rates and MoE expert routing benchmarks, use the real dataset mode (
--no-syntheticor omit--synthetic) so that ShareGPT and CodeAlpaca samples are downloaded and mixed. These are the same datasets used by the Medusa and EAGLE speculative decoding papers.
source .venv/bin/activate
benchmark run benchmark_config_test.yaml --syntheticbenchmark run benchmark_config_slurm.yaml --syntheticRemove --synthetic to download real datasets (ShareGPT + CodeAlpaca) for the first run.
benchmark run [CONFIG] [OPTIONS]
--results-dir PATH Where to store downloaded results [default: ./results]
--poll-interval INT Seconds between coordinator polls [default: 15]
--synthetic Use synthetic prompts (no HuggingFace download)
--no-live Disable Rich live table (useful when stdout is a file)
--dry-run Render templates and validate config without submitting
When running interactively, a Rich terminal table updates every poll interval:
Inference Benchmark Sweep
Model Backend Deploy State Rate TTFT p95 QPS
────────────────────────────────────────────────────────────────────────
Apertus-8B-... vllm TP=1 slurm BENCHMARKING 40/s 127ms 38.2
Apertus-8B-... vllm TP=1 slurm BENCHMARKING 20/s 89ms 19.6
TTFT p95 turns red when it exceeds the configured SLO.
PYTHONUNBUFFERED=1 benchmark run benchmark_config_slurm.yaml \
--synthetic --no-live --poll-interval 20 \
> sweep.log 2>&1 &
tail -f sweep.logCoordinator (laptop)
│ FirecREST MCP
├─► mkdir /capstor/.../runs/<run_id>/
├─► upload: run_config.json, prompts.json, load_generator.tar.gz, EDFs
├─► sbatch server.sbatch → Job A (GPU node, e.g. nid007656)
│ starts vLLM, writes endpoint.txt to capstor
└─► sbatch benchmarker.sbatch → Job B (CPU node, normal partition)
reads endpoint.txt, polls /health, runs rate sweep
writes results.db to capstor
│ FirecREST MCP
└─► download results.db → results/run_<id>.db
EDF files (CSCS Container Engine TOML format) are generated per experiment and referenced by absolute capstor path:
# templates/edf/vllm.toml.j2
image = "jfrog.svc.cscs.ch/docker/vllm-cxi:latest"
mounts = ["/users/stefschu", "/capstor"]
workdir = "{{ scratch_run_dir }}"
[annotations.com.hooks.aws_ofi_nccl]
enabled = "true"
variant = "cuda12" # enables HPE Slingshot for multi-node NCCLCoordinator (laptop)
│ kubectl
├─► apply deployment.yaml, service.yaml, ingress.yaml, pvc.yaml
│ Pod pulls model from PVC (/data/huggingface)
│ Ingress: https://<run_id>.swissai.svc.cscs.ch (public, TLS)
│ FirecREST MCP
├─► upload benchmarker artifacts + endpoint.txt to capstor
└─► sbatch benchmarker.sbatch → SLURM CPU job
reads capstor endpoint.txt, hits K8s ingress over internet
writes results.db to capstor
Capstor is SLURM-only. The K8s inference pod mounts a PVC for model weights but cannot access
/capstor. All capstor writes are done by the coordinator (via FirecREST) or the benchmarker SLURM job.
K8s CUDA note. The K8s nodes (breithorn) run CUDA driver 550 (12.4). The
VLLM_ENABLE_CUDA_COMPATIBILITY=1env var is injected automatically to enable CUDA Forward Compatibility so that cu129-compiled images run on the older driver. This is supported on datacenter GPUs (GH200).
The benchmarker writes progress.json to capstor every few seconds. The coordinator reads it via FirecREST and shows the current rate level and last-completed rate metrics in the live display.
To inspect the server logs from your laptop during a run:
# Find the job node
! squeue -j <JOB_ID> -h -o "%N"
# Tail server log via FirecREST (or SSH tunnel)
ssh -N -L 8080:<node>:8080 clariden.cscs.ch
curl http://localhost:8080/metrics # Prometheus metrics
curl http://localhost:8080/healthAfter the sweep completes, results are in results/run_<id>.db (one file per completed experiment). The coordinator merges them automatically at the end of the sweep.
# Merge all run_*.db files in results/
benchmark results merge results/
# Print max-QPS-at-SLO table
benchmark results summary results/results.db --slo 500Example output:
Max QPS at TTFT p95 SLO
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ Model ┃ Backend ┃ TP ┃ SLO (ms) ┃ Max QPS ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ Apertus-8B-... │ vllm │ 1 │ 500.0 │ 38.4 │
│ Apertus-8B-... │ sglang │ 1 │ 500.0 │ 41.2 │
└────────────────┴─────────┴────┴──────────┴─────────┘
source .venv/bin/activate
jupyter lab analysis/results.ipynbThe notebook produces five plot types (with the results/results.db path configurable at the top):
1. Latency-vs-rate curves TTFT p50/p95/p99 and E2E p95 plotted against offered rate λ for each experiment. The horizontal red dashed line is the SLO. The operating point is where p95 crosses the SLO.
TTFT (ms)
500 ┤ ─ ─ ─ ─ ─ ─ ─ ─ ─ SLO ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
200 ┤ ╭──── p99
100 ┤ ╭─────╯ p95
50 ┤ ─────────────────────╯ p50
0 └────┬───────┬───────┬───────┬───────┬───────
10 20 40 60 80 100
λ (req/s)
2. Max QPS summary bar chart One bar per experiment — the highest λ at which TTFT p95 ≤ SLO.
3. KV cache saturation overlay TTFT p95 and KV cache utilisation (%) on dual axes vs λ. The saturation knee (where the cache fills and latency spikes) is visible here.
4. Speculative decoding acceptance rate For experiments with speculative decoding enabled, shows acceptance rate vs λ. This requires real (non-synthetic) prompts to be meaningful.
5. Output token throughput Output tokens/second vs λ — the sustained throughput delivered to users.
New model: add an entry to experiments: in the config YAML. No code changes needed.
New backend: create three template files:
templates/edf/<backend>.toml.j2— container image + mounts- Add a branch to
templates/slurm/server.sbatch.j2 - Add a branch to
templates/k8s/deployment.yaml.j2 - Register the K8s image in
backends/k8s.py→BACKEND_IMAGES
New Dockerfile (for CXI support): copy example-docker-images/Dockerfile and replace the FROM line. The aws-ofi-nccl build is the same for all backends.
inference-benchmarking/
├── benchmark_config.yaml # full reference config
├── benchmark_config_slurm.yaml # SLURM-only sweep with extended rates
├── benchmark_config_test.yaml # quick smoke-test (2 experiments, 3 rate levels)
├── coordinator/
│ ├── coordinator.py # CLI: benchmark run / results merge / results summary
│ ├── config.py # Pydantic config models
│ ├── experiment.py # per-experiment state machine
│ ├── display.py # Rich live terminal table
│ ├── results.py # DB merge + QPS-at-SLO analysis
│ └── prompt_sampler.py # ShareGPT + CodeAlpaca sampling (or synthetic)
├── backends/
│ ├── mcp_client.py # FirecREST MCP HTTP client
│ ├── slurm.py # SLURM backend (uploads, job submit/poll/teardown)
│ └── k8s.py # K8s backend (kubectl apply/delete)
├── load_generator/ # runs as a SLURM job on Alps
│ ├── __main__.py # entry point
│ ├── generator.py # open-loop Poisson sweep
│ ├── metrics.py # per-rate aggregation + Prometheus scrape
│ └── db.py # SQLite writer
├── templates/
│ ├── edf/ # CSCS Container Engine TOML (Jinja2)
│ ├── slurm/ # sbatch scripts (Jinja2)
│ └── k8s/ # K8s manifests (Jinja2)
├── example-docker-images/
│ ├── Dockerfile # vllm-cxi (existing, aarch64 + Slingshot)
│ ├── Dockerfile.sglang # sglang-cxi (to build)
│ └── Dockerfile.dynamo # dynamo-cxi (to build)
├── analysis/
│ └── results.ipynb # Jupyter notebook for all plots
├── firecrest-mcp/ # local FirecREST MCP server
└── example-{k8s,slurm}-deployment/ # reference deployment examples
| Symptom | Cause | Fix |
|---|---|---|
Account parameter is required |
FirecREST requires account for uploads |
Set account: csstaff in config defaults |
Requested time limit is invalid |
Partition max walltime exceeded | Reduce server_time_limit (debug partition max ≈ 1 h) |
QOSMaxJobsPerUserLimit on benchmarker |
debug partition 1-job-per-user limit | Set benchmarker_partition: normal |
K8s pod CrashLoopBackOff (CUDA driver error) |
vllm-cxi:latest needs CUDA 12.6+, K8s nodes have 12.4 |
VLLM_ENABLE_CUDA_COMPATIBILITY=1 is injected automatically |
kubectl apply failed: failed to download openapi |
Rancher API not reachable from laptop | --validate=false is added automatically |
FirecREST 401 Access token expires too soon |
Transient token refresh | Non-fatal; coordinator retries on next poll |
outputSchema defined but no structured output |
Large file triggered S3 transfer path | Coordinator verifies capstor file existence and proceeds |