Most AI clusters waste 40–60% of GPU spend. piqc finds it in one command.
Read-only · No agents · No sidecars · Nothing installed permanently · Runs as a Job, prints results, exits.
Quick Start • Features • Commands • Output Formats • Installation
Run piqc scan against your cluster and get an instant cost report:
Discovered Inference Deployments
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Deployment ┃ Engine ┃ GPU ┃ Replicas ┃ GPU Util ┃ MFU ┃ $/1K tokens ┃ $/hr ┃ Idle $/day ┃ Tier Fit ┃ Namespace ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ meta-llama/Llama-3-70B-Inst │ vllm │ 8xH100-SXM4-80GB │ 2 │ 4% │ 3.1% │ $0.0842 │ $68.00 │ $1,566.72 │ ⚠ >A100-80GB │ production │
│ mistral-7b-instruct │ vllm │ 1xA100-SXM4-40GB │ 1 │ 11% │ 8.4% │ $0.0073 │ $2.50 │ $53.40 │ ⚠ >T4 │ production │
│ codellama-34b-staging │ vllm │ 4xH100-SXM4-80GB │ 1 │ 0% │ N/A │ N/A │ $17.00 │ $408.00 │ ⚠ >A100-40GB │ staging │
│ embedding-bge-large │ vllm │ 1xT4 │ 3 │ 82% │ N/A │ $0.0002 │ $1.35 │ $5.83 │ ✓ │ shared-services │
│ unknown-runtime-7f3a2 │ unknown │ 2xA100-SXM4-80GB │ 1 │ N/A │ N/A │ N/A │ $7.00 │ util unknown │ ? │ ml-platform │
└─────────────────────────────┴─────────┴──────────────────┴──────────┴──────────┴──────┴─────────────┴────────┴──────────────┴──────────────┴─────────────────┘
╭──────────────────────────────────── Cost Summary ──────────────────────────────────────╮
│ Total GPU spend rate : $95.85/hr │
│ │
│ Leased & idle (util <60%) : $2,033.95/day (pods running, GPUs underused) │
│ Unallocated nodes : $1,152.00/day (12 GPU(s) with no pods scheduled) │
│ Tier misplacement : $721.20/day (3 model(s) on oversized GPU tier) │
│ │
│ Total estimated leak : $3,907.15/day ($1,426,110/yr) │
│ │
│ Avg MFU (active deployments) : 15.7% (healthy range: 30–60%) │
╰────────────────────────────────────────────────────────────────────────────────────────╯
piqc surfaces three types of waste:
- Idle GPUs — pods running, GPUs sitting near-empty
- Tier misplacement — a 7B model on an H100 that only needs a T4
- Unallocated nodes — GPU nodes with no pods scheduled at all
Runs inside your cluster — no Docker auth or kubeconfig wrangling:
# Step 1 — Apply RBAC permissions (one-time setup)
kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yaml
# Step 2 — Run the scan
kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/scan-job.yaml
# Step 3 — View the output
kubectl logs -f job/piqc-scan -n kube-system
# Clean up when done
kubectl delete job piqc-scan -n kube-systemThe job auto-deletes itself after 10 minutes (
ttlSecondsAfterFinished: 600).
# Export a static kubeconfig with embedded credentials
kubectl config view --raw --flatten > /tmp/piqc-kubeconfig.yaml
# Run the scan
docker run --rm \
-v /tmp/piqc-kubeconfig.yaml:/root/.kube/config \
ghcr.io/paralleliq/piqc:latest \
scan --format tableSupports both linux/amd64 and linux/arm64.
git clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install
poetry run piqc scan --format table- Auto-Detection: Automatically discovers vLLM inference deployments across all namespaces
- Weighted Confidence Scoring: Uses multiple signals (images, env vars, CLI args, labels) with weighted scoring
- Framework Detection: Identifies vLLM with high accuracy using pattern matching and heuristics
- GPU Metrics: Real-time GPU utilization, memory, temperature, and power via
nvidia-smi - Runtime Metrics: Collects vLLM API metrics including:
- Request latency (P50, P95, P99)
- Token throughput (prefill & decode)
- KV cache utilization
- Queue depth and active requests
- Health status
- GPU underutilization — Deployments below 60% utilization threshold, with dollar waste per day and annualized
- Dark capacity — GPU nodes with no pods scheduled (paying for nodes sitting empty)
- Tier misplacement — Models running on an oversized GPU tier, with estimated cost delta per day
- Fragmentation — Nodes with free GPU slots too small to fit any running model
- Pending GPU pods — Workloads blocked from scheduling, shown with wait time
- Cost Summary panel — Total spend rate, all waste categories, total estimated leak per day and per year
- MFU (Model FLOPS Utilization) — Observed compute vs. theoretical GPU peak per deployment
- Cost per 1K tokens — GPU spend translated into a business metric comparable to API pricing
| Format | Description |
|---|---|
| Table | Cost report with MFU, $/1K tokens, idle waste (default) |
| YAML | Kubernetes-style ModelSpec files |
| JSON | Machine-readable JSON output |
| PIQC Facts | Standardized facts bundle for control plane integration |
- Parallel Processing: Multi-threaded scanning with configurable workers
- RBAC Support: Pre-configured ClusterRole and ServiceAccount manifests
- Flexible Modes: Auto-detect, remote (kubeconfig), or in-cluster execution
- Timeout Controls: Configurable operation timeouts
- Docker Image: Pre-built multi-platform image (
linux/amd64+linux/arm64) on GitHub Container Registry
|
🔴 AMD GPU Support Support for AMD Instinct and Radeon GPUs via
|
🌐 LLM-D (LLM-Distributed) Discovery and documentation for distributed LLM inference:
|
Scan your Kubernetes cluster for inference workloads and surface GPU waste.
piqc scan [OPTIONS]| Option | Default | Description |
|---|---|---|
--kubeconfig PATH |
~/.kube/config |
Path to kubeconfig file |
--context TEXT |
current | Kubernetes context to use |
-n, --namespace TEXT |
all | Specific namespace to scan |
--format [yaml|json|table] |
yaml |
Output format |
-o, --output PATH |
./output |
Output directory for generated files |
| Option | Default | Description |
|---|---|---|
--collect-runtime |
false |
Collect runtime metrics via vLLM API |
--no-exec |
false |
Disable pod exec (skip GPU metrics) |
--no-logs |
false |
Disable log reading |
--aggregate/--no-aggregate |
aggregate |
Aggregate metrics across pod replicas |
--contribute-benchmarks |
false |
Contribute anonymized GPU/model performance data to the Paralleliq benchmark dataset |
| Option | Default | Description |
|---|---|---|
--combined |
false |
Generate single combined output file |
--output-piqc |
false |
Generate piqc-facts.json (PIQC v0.1 schema) |
| Option | Default | Description |
|---|---|---|
--timeout INT |
30 |
Operation timeout in seconds |
--workers INT |
10 |
Number of parallel workers |
--mode [auto|remote|incluster|dry-run] |
auto |
Execution mode |
-v, --verbose |
false |
Enable verbose output |
--debug |
false |
Enable debug mode with detailed trace |
# Basic scan — discover all vLLM deployments and surface waste
piqc scan
# Scan specific namespace with JSON output
piqc scan -n production --format json
# Quick scan without GPU metrics (faster)
piqc scan --no-exec
# Collect runtime metrics from vLLM API
piqc scan --collect-runtime
# Generate PIQC facts bundle for control plane integration
piqc scan --output-piqc -o ./facts
# Table output to console (human-readable)
piqc scan --format table
# Custom kubeconfig and context
piqc scan --kubeconfig /path/to/config --context my-cluster
# Contribute anonymized GPU/model benchmarks to Paralleliq dataset
piqc scan --contribute-benchmarksTest connection to Kubernetes cluster and verify required permissions.
piqc test-connection [OPTIONS]| Option | Default | Description |
|---|---|---|
--kubeconfig PATH |
~/.kube/config |
Path to kubeconfig file |
--context TEXT |
current | Kubernetes context to use |
piqc versionRun piqc scan --format table — no flags required. See the output example above.
Tier Fit column:
| Symbol | Meaning |
|---|---|
✓ |
Model is on an appropriate GPU tier for its size |
⚠ >T4 |
Model is over-provisioned — minimum sufficient tier shown |
? |
Parameter count not parseable from model name |
Generates individual Kubernetes-style YAML files for each deployment:
apiVersion: modelspec/v1
kind: ModelSpec
metadata:
name: vllm-llama-7b
namespace: inference
collectionTimestamp: "2024-01-07T12:00:00Z"
collectorVersion: "1.0.0"
model:
name: meta-llama/Llama-2-7b-hf
architecture: llama
parameters: "7B"
identificationConfidence: 0.95
engine:
name: vllm
version: "0.4.0"
detectionConfidence: 0.95
inference:
precision: float16
tensorParallelSize: 4
maxModelLen: 4096
gpuMemoryUtilization: 0.90
resources:
replicas: 2
gpuCount: 4
gpus:
- type: A100-SXM4-80GB
memoryTotal: "80GB"
utilization: 87
memoryUsed: 72000
runtimeState:
vllm:
healthStatus: healthy
kvCacheUsagePercent: 45.2
avgPromptThroughput: 1250.5
avgGenerationThroughput: 85.3With --output-piqc, generates a standardized facts bundle for integration with the Paralleliq control plane:
{
"schemaVersion": "piqc-scan.v0.1",
"generatedAt": "2024-01-07T12:00:00Z",
"tool": {
"name": "piqc",
"version": "1.0.0"
},
"cluster": {
"context": "my-context",
"name": "my-cluster"
},
"objects": [
{
"workloadId": "ns/inference/deployment/vllm-llama-7b",
"facts": {
"runtime.engineType": {"value": "vllm", "dataConfidence": "high"},
"hardware.gpuType": {"value": "A100-SXM4-80GB", "dataConfidence": "high"},
"hardware.gpuCount": {"value": 4, "dataConfidence": "high"},
"observed.gpuUtilization": {"value": 87, "unit": "%", "dataConfidence": "high"},
"observed.kvCacheUsage": {"value": 45.2, "unit": "%", "dataConfidence": "high"}
}
}
]
}- Python: 3.11 or higher
- Kubernetes Access: Valid kubeconfig with cluster access
- Poetry: For development installation
git clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install
poetry run piqc --versiongit clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install --with dev
poetry run pytest tests/unit -vpiqc is read-only. It never creates, modifies, or deletes any resource in your cluster. The only write permission is pods/exec (to run nvidia-smi inside pods for GPU metrics) — and that can be disabled with --no-exec.
kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yaml| Resource | Verbs | Purpose |
|---|---|---|
pods |
get, list | Discover inference workloads |
pods/exec |
create | Run nvidia-smi for GPU metrics |
pods/log |
get | Enhanced framework detection |
namespaces |
get, list | Scan multiple namespaces |
deployments |
get, list | Identify deployment metadata |
statefulsets |
get, list | Identify StatefulSet workloads |
services |
get, list | Endpoint detection |
| Mode | Description |
|---|---|
auto |
Automatically detect if running in-cluster or remotely |
remote |
Force remote mode (uses kubeconfig) |
incluster |
Force in-cluster mode (uses ServiceAccount) |
dry-run |
Simulate scan without cluster access |
Use the in-cluster Job approach (Option 1 in Quick Start) — it runs inside the cluster and needs no auth plugins. Or export a static kubeconfig:
kubectl config view --raw --flatten > /tmp/piqc-kubeconfig.yaml
docker run --rm -v /tmp/piqc-kubeconfig.yaml:/root/.kube/config ghcr.io/paralleliq/piqc:latest scankubectl auth can-i list pods --all-namespaces
kubectl auth can-i create pods/exec -n <namespace>
kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yamlpiqc scan --no-execpiqc/
├── src/piqc/
│ ├── cli/ # CLI commands (scan, test-connection, version)
│ ├── collectors/ # Data collectors (vLLM config, GPU metrics)
│ ├── core/ # Core logic (orchestrator, discovery, k8s client)
│ ├── generators/ # Output generators (YAML, JSON, Table, PIQC)
│ ├── models/ # Pydantic data models (ModelSpec, PIQC schema)
│ ├── parsers/ # Configuration parsers (vLLM)
│ └── utils/ # Utilities (logging, exceptions)
├── tests/
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
├── rbac/ # Kubernetes RBAC manifests
├── docs/ # Documentation
└── examples/ # Example ModelSpec files
piqc tells you what's wrong. The Paralleliq control plane closes the loop — it ingests the piqc facts bundle and automatically remediates misplacement, underutilization, and OOM risk through human-approved Temporal workflows.
→ paralleliq.ai · info@paralleliq.ai
Apache License 2.0 — see LICENSE for details.