Skip to content

schuups/inference-benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alps Inference Benchmarking

Find the maximum queries per second (QPS) an LLM inference deployment can deliver on Alps, subject to latency constraints (e.g. TTFT p95 ≤ 500 ms). A deployment can be a single GPU (e.g. Apertus 8B), multiple GPUs on one node (e.g. Apertus 70B with TP=4), or multiple nodes (e.g. a 405B model across 4 nodes with TP=16). Workloads can be deployed on SLURM (clariden) or Kubernetes (breithorn). Everything is orchestrated from your laptop — no coordinator node is allocated on Alps.

See ARCHITECTURE.md for a full design description.


How it works

The tool runs an open-loop Poisson rate sweep: requests are sent at a fixed arrival rate λ (req/s), independent of whether previous requests have completed. As λ increases, the server queue fills and latency rises. The highest λ at which TTFT p95 stays below the SLO is the max QPS at SLO.

Why open-loop? A closed-loop design (keep N requests in-flight) artificially caps concurrency and hides queuing effects. Open-loop with Poisson arrivals faithfully reproduces what real users experience when the system approaches saturation.

For each experiment the coordinator:

  1. Uploads sbatch scripts, EDF files, and the load generator to /capstor/scratch/cscs/stefschu/ via FirecREST
  2. Submits a server job (inference server on GPU nodes) and a benchmarker job (load generator on CPU nodes)
  3. Polls job status every N seconds; once the server is healthy it triggers the benchmarker
  4. The benchmarker sweeps rate levels, writing per-request metrics to a SQLite DB on capstor
  5. On completion the coordinator downloads the DB and runs the next experiment

Prerequisites

Requirement Notes
Python ≥ 3.11 On your laptop
uv Package manager
kubectl Configured for the Alps K8s cluster
FirecREST MCP server Running locally (see below)
Alps account csstaff or similar

Install Python dependencies

# From the repo root
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[analysis]"

Start the FirecREST MCP server

The coordinator talks to Alps SLURM via a local FirecREST MCP server. Start it once per session:

source .venv/bin/activate
uv run firecrest-mcp/server.py
# Server listens at http://localhost:8888/mcp

Credentials are read from .env:

# .env (git-ignored)
OAUTH_CLIENT_ID=...
OAUTH_CLIENT_SECRET=...
OAUTH_TOKEN_URL=https://auth.cscs.ch/auth/realms/cscs/protocol/openid-connect/token
OAUTH_SCOPES=openid
BACKEND_API_BASE_URL=https://firecrest.cscs.ch

Register the MCP server with Claude Code (one-time setup):

claude mcp add firecrest http://localhost:8888/mcp --transport http

Configuration

All benchmark parameters live in a single YAML file. The key fields:

defaults:
  account: csstaff
  partition: debug             # server job partition (GPU)
  benchmarker_partition: normal  # load generator partition (CPU, avoids QOS collision)
  scratch_base: /capstor/scratch/cscs/stefschu/benchmark-runs
  hf_token_path: ~/.hf_token
  hf_home: /capstor/scratch/cscs/stefschu/hf-cache
  warmup_duration_s: 30        # requests sent but not measured
  measurement_duration_s: 60   # measured window per rate level
  rate_levels: [5, 10, 20, 40, 60, 80, 100, 150]  # req/s to sweep
  latency_slos:
    ttft_p95_ms: 500           # time-to-first-token SLO
    e2e_p95_ms: 10000          # end-to-end latency SLO (informational)
  gpus_per_node: 4     # GPUs per node on clariden (GH200); used to derive --nodes
  port: 8080
  server_time_limit: "01:00:00"
  benchmarker_time_limit: "01:00:00"
  max_parallel_experiments: 2  # how many (model, config) pairs run simultaneously
  server_ready_timeout_s: 900  # abort if server doesn't start within 15 min
  k8s_namespace: ml
  ingress_domain: swissai.svc.cscs.ch

dataset:
  sharegpt_fraction: 0.7    # 70% chat (ShareGPT), 30% coding (CodeAlpaca)
  num_prompts: 500          # total prompts to pre-sample
  input_length:
    distribution: normal    # or: uniform, fixed
    mean: 512
    std: 200
  output_length:
    distribution: normal
    mean: 256
    std: 128

experiments:
  - model: swiss-ai/Apertus-8B-Instruct-2509
    backend: vllm            # vllm | sglang | dynamo
    deployment: slurm        # slurm | k8s
    backend_config:
      tensor_parallel_size: 1
      max_model_len: 32768
      gpu_memory_utilization: 0.90
  # Multi-node example: 16-GPU deployment across 4 nodes (tensor_parallel_size=16,
  # gpus_per_node=4 → 4 nodes allocated, Ray bootstrapped automatically)
  # - model: meta-llama/Llama-3.1-405B-Instruct
  #   backend: vllm
  #   deployment: slurm
  #   backend_config:
  #     tensor_parallel_size: 16
  #     max_model_len: 32768
  #     gpu_memory_utilization: 0.90

      # speculative_decoding:     # optional
      #   draft_model: swiss-ai/Apertus-1B-Draft
      #   num_speculative_tokens: 5

Distribution parameters

Parameter distribution: normal distribution: uniform distribution: fixed
Required fields mean, std min, max value
Token count sampled from N(mean, std) sampled from U(min, max) exactly value

Input tokens approximate the prompt length. Output tokens are passed as max_tokens to the server (actual output may be shorter if the model hits EOS).

Why real data matters for some tests: Synthetic prompts are sufficient for throughput benchmarking. However, for speculative decoding acceptance rates and MoE expert routing benchmarks, use the real dataset mode (--no-synthetic or omit --synthetic) so that ShareGPT and CodeAlpaca samples are downloaded and mixed. These are the same datasets used by the Medusa and EAGLE speculative decoding papers.


Running a benchmark sweep

Quick test (synthetic prompts, minimal rates)

source .venv/bin/activate

benchmark run benchmark_config_test.yaml --synthetic

Full sweep (realistic distributions, extended rate levels)

benchmark run benchmark_config_slurm.yaml --synthetic

Remove --synthetic to download real datasets (ShareGPT + CodeAlpaca) for the first run.

Options

benchmark run [CONFIG] [OPTIONS]

  --results-dir PATH       Where to store downloaded results [default: ./results]
  --poll-interval INT      Seconds between coordinator polls [default: 15]
  --synthetic              Use synthetic prompts (no HuggingFace download)
  --no-live                Disable Rich live table (useful when stdout is a file)
  --dry-run                Render templates and validate config without submitting

Live display

When running interactively, a Rich terminal table updates every poll interval:

                     Inference Benchmark Sweep
 Model              Backend   Deploy  State        Rate   TTFT p95  QPS
────────────────────────────────────────────────────────────────────────
 Apertus-8B-...    vllm TP=1  slurm  BENCHMARKING 40/s   127ms     38.2
 Apertus-8B-...    vllm TP=1  slurm  BENCHMARKING 20/s   89ms      19.6

TTFT p95 turns red when it exceeds the configured SLO.

Background / CI mode

PYTHONUNBUFFERED=1 benchmark run benchmark_config_slurm.yaml \
  --synthetic --no-live --poll-interval 20 \
  > sweep.log 2>&1 &

tail -f sweep.log

What happens on Alps

SLURM path

Coordinator (laptop)
  │  FirecREST MCP
  ├─► mkdir /capstor/.../runs/<run_id>/
  ├─► upload: run_config.json, prompts.json, load_generator.tar.gz, EDFs
  ├─► sbatch server.sbatch        → Job A (GPU node, e.g. nid007656)
  │     starts vLLM, writes endpoint.txt to capstor
  └─► sbatch benchmarker.sbatch  → Job B (CPU node, normal partition)
        reads endpoint.txt, polls /health, runs rate sweep
        writes results.db to capstor
  │  FirecREST MCP
  └─► download results.db → results/run_<id>.db

EDF files (CSCS Container Engine TOML format) are generated per experiment and referenced by absolute capstor path:

# templates/edf/vllm.toml.j2
image = "jfrog.svc.cscs.ch/docker/vllm-cxi:latest"
mounts = ["/users/stefschu", "/capstor"]
workdir = "{{ scratch_run_dir }}"

[annotations.com.hooks.aws_ofi_nccl]
enabled = "true"
variant  = "cuda12"   # enables HPE Slingshot for multi-node NCCL

K8s path

Coordinator (laptop)
  │  kubectl
  ├─► apply deployment.yaml, service.yaml, ingress.yaml, pvc.yaml
  │     Pod pulls model from PVC (/data/huggingface)
  │     Ingress: https://<run_id>.swissai.svc.cscs.ch  (public, TLS)
  │  FirecREST MCP
  ├─► upload benchmarker artifacts + endpoint.txt to capstor
  └─► sbatch benchmarker.sbatch  → SLURM CPU job
        reads capstor endpoint.txt, hits K8s ingress over internet
        writes results.db to capstor

Capstor is SLURM-only. The K8s inference pod mounts a PVC for model weights but cannot access /capstor. All capstor writes are done by the coordinator (via FirecREST) or the benchmarker SLURM job.

K8s CUDA note. The K8s nodes (breithorn) run CUDA driver 550 (12.4). The VLLM_ENABLE_CUDA_COMPATIBILITY=1 env var is injected automatically to enable CUDA Forward Compatibility so that cu129-compiled images run on the older driver. This is supported on datacenter GPUs (GH200).


Viewing results during a sweep

The benchmarker writes progress.json to capstor every few seconds. The coordinator reads it via FirecREST and shows the current rate level and last-completed rate metrics in the live display.

To inspect the server logs from your laptop during a run:

# Find the job node
! squeue -j <JOB_ID> -h -o "%N"

# Tail server log via FirecREST (or SSH tunnel)
ssh -N -L 8080:<node>:8080 clariden.cscs.ch
curl http://localhost:8080/metrics    # Prometheus metrics
curl http://localhost:8080/health

Analysing results

After the sweep completes, results are in results/run_<id>.db (one file per completed experiment). The coordinator merges them automatically at the end of the sweep.

CLI summary

# Merge all run_*.db files in results/
benchmark results merge results/

# Print max-QPS-at-SLO table
benchmark results summary results/results.db --slo 500

Example output:

                    Max QPS at TTFT p95 SLO
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ Model          ┃ Backend ┃ TP ┃ SLO (ms) ┃ Max QPS ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ Apertus-8B-... │ vllm    │ 1  │ 500.0    │    38.4 │
│ Apertus-8B-... │ sglang  │ 1  │ 500.0    │    41.2 │
└────────────────┴─────────┴────┴──────────┴─────────┘

Jupyter notebook

source .venv/bin/activate
jupyter lab analysis/results.ipynb

The notebook produces five plot types (with the results/results.db path configurable at the top):

1. Latency-vs-rate curves TTFT p50/p95/p99 and E2E p95 plotted against offered rate λ for each experiment. The horizontal red dashed line is the SLO. The operating point is where p95 crosses the SLO.

TTFT (ms)
  500 ┤ ─ ─ ─ ─ ─ ─ ─ ─ ─ SLO ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  200 ┤                                    ╭──── p99
  100 ┤                              ╭─────╯ p95
   50 ┤        ─────────────────────╯ p50
    0 └────┬───────┬───────┬───────┬───────┬───────
          10      20      40      60      80     100
                                        λ (req/s)

2. Max QPS summary bar chart One bar per experiment — the highest λ at which TTFT p95 ≤ SLO.

3. KV cache saturation overlay TTFT p95 and KV cache utilisation (%) on dual axes vs λ. The saturation knee (where the cache fills and latency spikes) is visible here.

4. Speculative decoding acceptance rate For experiments with speculative decoding enabled, shows acceptance rate vs λ. This requires real (non-synthetic) prompts to be meaningful.

5. Output token throughput Output tokens/second vs λ — the sustained throughput delivered to users.


Adding a new model or backend

New model: add an entry to experiments: in the config YAML. No code changes needed.

New backend: create three template files:

  • templates/edf/<backend>.toml.j2 — container image + mounts
  • Add a branch to templates/slurm/server.sbatch.j2
  • Add a branch to templates/k8s/deployment.yaml.j2
  • Register the K8s image in backends/k8s.pyBACKEND_IMAGES

New Dockerfile (for CXI support): copy example-docker-images/Dockerfile and replace the FROM line. The aws-ofi-nccl build is the same for all backends.


Repo layout

inference-benchmarking/
├── benchmark_config.yaml          # full reference config
├── benchmark_config_slurm.yaml    # SLURM-only sweep with extended rates
├── benchmark_config_test.yaml     # quick smoke-test (2 experiments, 3 rate levels)
├── coordinator/
│   ├── coordinator.py             # CLI: benchmark run / results merge / results summary
│   ├── config.py                  # Pydantic config models
│   ├── experiment.py              # per-experiment state machine
│   ├── display.py                 # Rich live terminal table
│   ├── results.py                 # DB merge + QPS-at-SLO analysis
│   └── prompt_sampler.py          # ShareGPT + CodeAlpaca sampling (or synthetic)
├── backends/
│   ├── mcp_client.py              # FirecREST MCP HTTP client
│   ├── slurm.py                   # SLURM backend (uploads, job submit/poll/teardown)
│   └── k8s.py                     # K8s backend (kubectl apply/delete)
├── load_generator/                # runs as a SLURM job on Alps
│   ├── __main__.py                # entry point
│   ├── generator.py               # open-loop Poisson sweep
│   ├── metrics.py                 # per-rate aggregation + Prometheus scrape
│   └── db.py                      # SQLite writer
├── templates/
│   ├── edf/                       # CSCS Container Engine TOML (Jinja2)
│   ├── slurm/                     # sbatch scripts (Jinja2)
│   └── k8s/                       # K8s manifests (Jinja2)
├── example-docker-images/
│   ├── Dockerfile                 # vllm-cxi (existing, aarch64 + Slingshot)
│   ├── Dockerfile.sglang          # sglang-cxi (to build)
│   └── Dockerfile.dynamo          # dynamo-cxi (to build)
├── analysis/
│   └── results.ipynb              # Jupyter notebook for all plots
├── firecrest-mcp/                 # local FirecREST MCP server
└── example-{k8s,slurm}-deployment/  # reference deployment examples

Troubleshooting

Symptom Cause Fix
Account parameter is required FirecREST requires account for uploads Set account: csstaff in config defaults
Requested time limit is invalid Partition max walltime exceeded Reduce server_time_limit (debug partition max ≈ 1 h)
QOSMaxJobsPerUserLimit on benchmarker debug partition 1-job-per-user limit Set benchmarker_partition: normal
K8s pod CrashLoopBackOff (CUDA driver error) vllm-cxi:latest needs CUDA 12.6+, K8s nodes have 12.4 VLLM_ENABLE_CUDA_COMPATIBILITY=1 is injected automatically
kubectl apply failed: failed to download openapi Rancher API not reachable from laptop --validate=false is added automatically
FirecREST 401 Access token expires too soon Transient token refresh Non-fatal; coordinator retries on next poll
outputSchema defined but no structured output Large file triggered S3 transfer path Coordinator verifies capstor file existence and proceeds

About

A tool to sweep through LLM engines configuration options to understand the pareto frontier on our infrastructure.

Resources

Stars

Watchers

Forks

Contributors