From 375da9f3c8066fbbaf5f9b772f2bcca068ce2d66 Mon Sep 17 00:00:00 2001
From: Mohammed Arif <marif@splunk.com>
Date: Thu, 30 Apr 2026 00:09:50 +0530
Subject: [PATCH 1/2] feat: concise documentation for k0s setup

---
 tools/cluster_setup/K0S_QUICKSTART.md | 295 ++++++++++++++++++++++++++
 1 file changed, 295 insertions(+)
 create mode 100644 tools/cluster_setup/K0S_QUICKSTART.md
diff --git a/tools/cluster_setup/K0S_QUICKSTART.md b/tools/cluster_setup/K0S_QUICKSTART.md
new file mode 100644
index 0000000..2426745
--- /dev/null
+++ b/tools/cluster_setup/K0S_QUICKSTART.md
@@ -0,0 +1,295 @@
+# k0s Cluster Setup — Quickstart
+
+Deploys the complete Splunk AI Platform stack on k0s Kubernetes using pre-provisioned bare-metal or VM nodes.
+
+> **Requires:** `existingIPs` in config YAML, external S3-compatible object storage, SSH access to all nodes.
+
+## 1. Prerequisites
+
+**Admin workstation:** `kubectl`, `helm`, `git`, `jq`, `yq`
+
+**Nodes (all):** RHEL 9/10, AL2023, or Debian/Ubuntu · passwordless SSH + sudo · Python 3.8+
+
+| Node Type | Min CPU | Min RAM | Min Disk | Notes |
+|-----------|---------|---------|----------|-------|
+| Controller | 4 | 8 GB | 100 GB | API server, etcd, scheduler |
+| CPU Worker | 8 | 32 GB | 200 GB | Weaviate, Ray head, Splunk |
+| GPU Worker | 8 | 32 GB | 500 GB | NVIDIA GPU required |
+
+**Ports between nodes:** 22 (SSH), 6443 (API), 2380 (etcd), 10250 (kubelet), 8132 (konnectivity), 4789/UDP (VXLAN), 179 (Calico BGP)
+
+**External storage:** Any S3-compatible endpoint (SeaweedFS, MinIO, AWS S3). Not deployed by the script.
+
+## 2. Quick Start
+
+```bash
+cd tools/cluster_setup
+cp k0s-cluster-config.yaml my-cluster.yaml
+# Edit my-cluster.yaml — set IPs, SSH key, images, storage endpoint
+CONFIG_FILE=./my-cluster.yaml ./k0s_cluster_with_stack.sh install
+```
+
+Verify:
+```bash
+export KUBECONFIG=~/.kube/k0s-<cluster-name>
+kubectl get nodes
+kubectl get aiplatform -n ai-platform
+```
+
+## 3. Commands
+
+| Command | Description |
+|---------|-------------|
+| `install` | Create k0s cluster + deploy full AI Platform stack |
+| `delete` | Stop k0s, remove services |
+| `clean-all` | Stop + reset + wipe all k0s state from every node |
+| `join-workers` | Add or rejoin worker nodes to an existing cluster |
+
+```bash
+CONFIG_FILE=./my-cluster.yaml ./k0s_cluster_with_stack.sh <command>
+```
+
+**Environment variables:**
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `CONFIG_FILE` | `./k0s-cluster-config.yaml` | Config file path |
+| `AUTO_APPROVE` | `false` | Skip confirmation prompts |
+| `USE_EXISTING` | from config | Override `cluster.useExisting` |
+| `LOG_DIR` | `./logs` | Session log directory |
+
+## 4. What `install` Does
+
+```
+1. Load config → validate images → patch RELATED_IMAGE_* in manifests
+2. Preflight checks (tools, SSH, disk space)
+3. Install k0s cluster (safety gate → clean state → controller → workers → labels)
+4. Phase 1 (parallel): cert-manager, kube-prometheus, NVIDIA host drivers
+5. Ensure S3 credentials secret
+6. Phase 2 (parallel): OTel operator, Ray operator, Splunk operator, NVIDIA device plugin
+7. Sequential: image pull secrets → Splunk standalone → AI operator → AIPlatform CR
+8. Health checks → access info
+```
+
+**Safety gate:** If the controller already has Ready nodes, `install` refuses to wipe. Use `useExisting: auto` or run `delete` first.
+
+**Session logging:** All output → `logs/k0s-install-YYYY-MM-DD_HH-MM-SS.log`
+
+## 5. Configuration Reference
+
+The config template is `k0s-cluster-config.yaml`. Copy it and edit. Key sections:
+
+### cluster
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `name` | Yes | — | Cluster name (kubeconfig, labels) |
+| `useExisting` | No | `never` | `auto` / `force` / `never` |
+| `sshUser` | Yes | `ubuntu` | SSH user for all nodes |
+| `sshKeyPath` | Yes | — | SSH private key path |
+
+### nodes
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `controllers` | No | `1` | Controller count (1 or 3 for HA) |
+| `cpuWorkers` | No | `2` | First N workers labeled CPU |
+| `gpuWorkers` | No | `1` | Remaining workers labeled GPU |
+| `existingIPs.controllers` | **Yes** | — | Controller IP list |
+| `existingIPs.workers` | **Yes** | — | Worker IP list |
+
+### storage
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `storageClass` | No | `local-path` | StorageClass for PVCs |
+| `vectorDbSize` | No | `50Gi` | Weaviate PV size |
+| `minimumDiskSpace.controller` | No | `100` | Preflight disk check (GB) |
+| `minimumDiskSpace.cpuWorker` | No | `200` | Preflight disk check (GB) |
+| `minimumDiskSpace.gpuWorker` | No | `500` | Preflight disk check (GB) |
+| `objectStore.type` | No | `minio` | `aws` / `s3compat` / `minio` / `seaweedfs` |
+| `objectStore.bucket` | No | `ai-platform-data` | Bucket name |
+| `objectStore.endpoint` | **Yes*** | — | S3 endpoint (*required for non-AWS) |
+| `objectStore.auth.rootUser` | Yes | — | Access key |
+| `objectStore.auth.rootPassword` | Yes | — | Secret key |
+
+### images
+
+Short paths auto-prefixed with `images.registry`. All marked **Yes** are required; others have defaults.
+
+| Field | Req | Default |
+|-------|-----|---------|
+| `registry` | No | `""` |
+| `operator.image` | **Yes** | — |
+| `splunk.image` | **Yes** | — |
+| `splunk.operatorImage` | No | `docker.io/splunk/splunk-operator:3.0.0` |
+| `ray.headImage` | **Yes** | — |
+| `ray.workerImage` | **Yes** | — |
+| `weaviate.image` | **Yes** | — |
+| `saia.apiImage` | **Yes** | — |
+| `saia.apiV2Image` | **Yes** | — |
+| `saia.dataLoaderImage` | **Yes** | — |
+| `nginx.image` | No | `docker.io/library/nginx:1.27-alpine` |
+| `fluentBit.image` | No | `fluent/fluent-bit:1.9.6` |
+| `otelCollector.image` | No | `otel/opentelemetry-collector-contrib:0.122.1` |
+
+### aiPlatform
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `name` | No | `${CLUSTER_NAME}-ai-platform` | CR name |
+| `defaultAcceleratorType` | No | `""` | `L40S` / `H100` / empty |
+| `workerGroupConfig.imageRegistry` | No | `""` | Ray worker image override |
+| `features[].name` | Yes | — | Feature name (e.g., `saia`) |
+| `features[].version` | Yes | — | Feature version |
+| `cpuScheduling` | No | auto | `nodeSelector` + `tolerations` for CPU pods |
+| `gpuScheduling` | No | auto | `nodeSelector` + `tolerations` for GPU pods |
+| `serviceTemplate.type` | No | — | `NodePort` / `LoadBalancer` for SAIA exposure |
+| `serviceTemplate.nodePort` | No | — | Port number (NodePort only) |
+
+### imagePullSecrets
+
+The `secrets[]` list is **not consumed**. The script auto-detects secrets by checking hardcoded names (`ecr-registry-secret`, `docker-hub-secret`, `gcr-secret`, `acr-secret`, `custom-registry-secret`).
+
+| Field | Description |
+|-------|-------------|
+| `autoCreateECR` | Create ECR secret from AWS creds |
+| `dockerHub.enabled` | Create Docker Hub secret |
+| `gcr.enabled` | Create GCR secret |
+| `acr.enabled` | Create ACR secret |
+| `custom.enabled` | Create custom registry secret |
+
+ECR tokens expire after 12 hours. Re-run install or set up a CronJob to refresh.
+
+### ecr
+
+| Field | Description |
+|-------|-------------|
+| `account` | AWS account ID |
+| `region` | ECR region |
+
+## 6. Node Labels & GPU
+
+The script auto-labels nodes:
+
+| Node type | Key labels |
+|-----------|------------|
+| Controller | `splunk.ai/workload-type: control-plane` |
+| CPU Worker | `splunk.ai/workload-type: cpu`, `splunk.ai/instance-type: cpu-worker` |
+| GPU Worker | `splunk.ai/workload-type: gpu`, `nvidia.com/gpu: "true"`, taint `nvidia.com/gpu=true:NoSchedule` |
+
+**NVIDIA drivers** are installed directly on GPU nodes (not GPU Operator). Supported: RHEL 9/10, AL2023, Debian/Ubuntu. The script installs kernel headers, CUDA repo, `cuda-drivers`, NVIDIA Container Toolkit, then verifies with `nvidia-smi`.
+
+## 7. Troubleshooting
+
+**SSH failures:**
+```bash
+ssh -i ~/.ssh/key.pem user@node-ip hostname   # test connectivity
+chmod 600 ~/.ssh/key.pem                       # fix permissions
+```
+
+**Safety gate ("refusing to wipe"):**
+Set `useExisting: auto` in config, or run `delete` then `install`.
+
+**k0s issues:**
+```bash
+ssh user@controller-ip "sudo k0s status"
+ssh user@controller-ip "sudo journalctl -u k0scontroller -f"
+```
+
+**Worker join failures:**
+```bash
+CONFIG_FILE=./my-config.yaml ./k0s_cluster_with_stack.sh join-workers
+```
+
+**GPU not detected:**
+```bash
+kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
+ssh user@gpu-node nvidia-smi
+```
+
+**AIPlatform not ready:**
+```bash
+kubectl describe aiplatform -n ai-platform
+kubectl logs -n splunk-ai-operator-system deployment/splunk-ai-operator-controller-manager
+```
+
+**Session logs:**
+```bash
+ls -lt tools/cluster_setup/logs/
+tail -f tools/cluster_setup/logs/k0s-install-*.log
+```
+
+## 8. Air-Gapped Deployment
+
+1. On a connected machine: download k0s binary, pull all container images (see table below), download Helm charts
+2. Transfer to air-gapped nodes: copy k0s binary, load images into local registry, copy manifests
+3. Set `images.registry` to your local registry, `autoCreateECR: false`
+4. Run `install`
+
+### Internet Dependencies (for pre-staging)
+
+**Binaries/charts downloaded by the script:**
+
+| What | Source |
+|------|--------|
+| k0s binary | `https://get.k0s.sh` |
+| cert-manager v1.13.0 | `github.com/cert-manager/cert-manager` |
+| kube-prometheus-stack | `prometheus-community` Helm repo |
+| opentelemetry-operator | `open-telemetry` Helm repo |
+| kuberay-operator v1.2.2 | `ray-project` Helm repo |
+| NVIDIA device plugin | `github.com/NVIDIA/k8s-device-plugin` |
+| local-path-provisioner | `github.com/rancher/local-path-provisioner` |
+
+**Container images pulled at runtime:**
+
+| Image | Default Source |
+|-------|---------------|
+| Splunk AI Operator, Ray Head/Worker, SAIA API v1/v2, Data Loader, Splunk Enterprise | ECR or configured registry |
+| Weaviate | `docker.io/semitechnologies/weaviate` |
+| Nginx | `docker.io/library/nginx:1.27-alpine` |
+| Fluent Bit | `docker.io/fluent/fluent-bit:1.9.6` |
+| OTel Collector | `docker.io/otel/opentelemetry-collector-contrib:0.122.1` |
+| Splunk Operator | `docker.io/splunk/splunk-operator:3.0.0` |
+| KubeRay Operator | `quay.io/kuberay/operator:v1.2.2` |
+| Prometheus, Grafana, cert-manager, NVIDIA plugin, local-path | Pulled by their respective Helm charts/manifests |
+
+**NVIDIA packages on GPU nodes (RHEL/AL2023/Ubuntu):**
+
+| Package | Source |
+|---------|--------|
+| CUDA drivers | `developer.download.nvidia.com/compute/cuda/repos/` |
+| Container Toolkit | `nvidia.github.io/libnvidia-container/` |
+| EPEL (RHEL 10 only) | `dl.fedoraproject.org/pub/epel/` |
+
+## 9. Architecture
+
+```
+┌──────────────────────────────────────────────────┐
+│            k0s Controller Node(s)                │
+│   API Server :6443 · etcd :2380 · Konnectivity  │
+└────────────────────┬─────────────────────────────┘
+                     │ Calico VXLAN (10.244.0.0/16)
+       ┌─────────────┼─────────────┐
+┌──────▼───────┐ ┌───▼──────────┐ ┌▼──────────────┐
+│ CPU Worker   │ │ CPU Worker   │ │ GPU Worker     │
+│ Ray Head     │ │ Weaviate     │ │ Ray GPU Pods   │
+│ Splunk       │ │ Ray CPU Pods │ │ AI Inference   │
+│ Monitoring   │ │ AI Services  │ │                │
+└──────────────┘ └──────────────┘ └────────────────┘
+                     │
+        ┌────────────▼─────────────┐
+        │ External Object Storage  │
+        │ (SeaweedFS / MinIO / S3) │
+        └──────────────────────────┘
+```
+
+**Operators deployed:** Splunk AI Operator, Splunk Operator, KubeRay v1.2.2, cert-manager v1.13.0, OTel Operator, NVIDIA device plugin
+
+**Resource hierarchy:** `AIPlatform CR → AIService → RayService → RayCluster → Ray Pods`
+
+**Secret propagation:** `AIPlatform CR → AIService → RayCluster/Jobs → Pods`
+
+---
+
+*Version 3.0 · April 2026 · Splunk AI Platform Team*

From 7ffcb9ec2f63d5a35b5c18afd149ee854002e5bc Mon Sep 17 00:00:00 2001
From: Mohammed Arif <marif@splunk.com>
Date: Thu, 30 Apr 2026 00:49:20 +0530
Subject: [PATCH 2/2] fix: updated k0s quick start readme

---
 tools/cluster_setup/K0S_QUICKSTART.md | 32 +++++++++++++--------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/tools/cluster_setup/K0S_QUICKSTART.md b/tools/cluster_setup/K0S_QUICKSTART.md
index 2426745..ab67e91 100644
--- a/tools/cluster_setup/K0S_QUICKSTART.md
+++ b/tools/cluster_setup/K0S_QUICKSTART.md
@@ -8,13 +8,13 @@ Deploys the complete Splunk AI Platform stack on k0s Kubernetes using pre-provis
 
 **Admin workstation:** `kubectl`, `helm`, `git`, `jq`, `yq`
 
-**Nodes (all):** RHEL 9/10, AL2023, or Debian/Ubuntu · passwordless SSH + sudo · Python 3.8+
+**Nodes (all):** RHEL 9 · passwordless SSH + sudo · Python 3.8+
 
 | Node Type | Min CPU | Min RAM | Min Disk | Notes |
 |-----------|---------|---------|----------|-------|
 | Controller | 4 | 8 GB | 100 GB | API server, etcd, scheduler |
 | CPU Worker | 8 | 32 GB | 200 GB | Weaviate, Ray head, Splunk |
-| GPU Worker | 8 | 32 GB | 500 GB | NVIDIA GPU required |
+| GPU Worker | 8 | 32 GB | 500 GB | NVIDIA GPU required (3 * H100, 3 * L40S) |
 
 **Ports between nodes:** 22 (SSH), 6443 (API), 2380 (etcd), 10250 (kubelet), 8132 (konnectivity), 4789/UDP (VXLAN), 179 (Calico BGP)
 
@@ -92,9 +92,9 @@ The config template is `k0s-cluster-config.yaml`. Copy it and edit. Key sections
 
 | Field | Required | Default | Description |
 |-------|----------|---------|-------------|
-| `controllers` | No | `1` | Controller count (1 or 3 for HA) |
-| `cpuWorkers` | No | `2` | First N workers labeled CPU |
-| `gpuWorkers` | No | `1` | Remaining workers labeled GPU |
+| `controllers` | **Yes** | `1` | Controller count (1 or 3 for HA) |
+| `cpuWorkers` | **Yes** | `2` | First N workers labeled CPU |
+| `gpuWorkers` | **Yes** | `1` | Remaining workers labeled GPU |
 | `existingIPs.controllers` | **Yes** | — | Controller IP list |
 | `existingIPs.workers` | **Yes** | — | Worker IP list |
 
@@ -102,14 +102,14 @@ The config template is `k0s-cluster-config.yaml`. Copy it and edit. Key sections
 
 | Field | Required | Default | Description |
 |-------|----------|---------|-------------|
-| `storageClass` | No | `local-path` | StorageClass for PVCs |
-| `vectorDbSize` | No | `50Gi` | Weaviate PV size |
+| `storageClass` | **Yes** | `local-path` | StorageClass for PVCs |
+| `vectorDbSize` | **Yes** | `50Gi` | Weaviate PV size |
 | `minimumDiskSpace.controller` | No | `100` | Preflight disk check (GB) |
 | `minimumDiskSpace.cpuWorker` | No | `200` | Preflight disk check (GB) |
 | `minimumDiskSpace.gpuWorker` | No | `500` | Preflight disk check (GB) |
-| `objectStore.type` | No | `minio` | `aws` / `s3compat` / `minio` / `seaweedfs` |
-| `objectStore.bucket` | No | `ai-platform-data` | Bucket name |
-| `objectStore.endpoint` | **Yes*** | — | S3 endpoint (*required for non-AWS) |
+| `objectStore.type` | **Yes** | `minio` | `aws` / `s3compat` / `minio` / `seaweedfs` |
+| `objectStore.bucket` | **Yes** | `ai-platform-data` | Bucket name |
+| `objectStore.endpoint` | **Yes** | — | S3 endpoint (*required for non-AWS) |
 | `objectStore.auth.rootUser` | Yes | — | Access key |
 | `objectStore.auth.rootPassword` | Yes | — | Secret key |
 
@@ -137,15 +137,15 @@ Short paths auto-prefixed with `images.registry`. All marked **Yes** are require
 
 | Field | Required | Default | Description |
 |-------|----------|---------|-------------|
-| `name` | No | `${CLUSTER_NAME}-ai-platform` | CR name |
-| `defaultAcceleratorType` | No | `""` | `L40S` / `H100` / empty |
+| `name` | **Yes** | `${CLUSTER_NAME}-ai-platform` | CR name |
+| `defaultAcceleratorType` | **Yes** | `""` | `L40S` / `H100` / empty |
 | `workerGroupConfig.imageRegistry` | No | `""` | Ray worker image override |
 | `features[].name` | Yes | — | Feature name (e.g., `saia`) |
 | `features[].version` | Yes | — | Feature version |
 | `cpuScheduling` | No | auto | `nodeSelector` + `tolerations` for CPU pods |
 | `gpuScheduling` | No | auto | `nodeSelector` + `tolerations` for GPU pods |
-| `serviceTemplate.type` | No | — | `NodePort` / `LoadBalancer` for SAIA exposure |
-| `serviceTemplate.nodePort` | No | — | Port number (NodePort only) |
+| `serviceTemplate.type` | **Yes** | — | `NodePort` / `LoadBalancer` for SAIA exposure |
+| `serviceTemplate.nodePort` | **Yes** | — | Port number (NodePort only) |
 
 ### imagePullSecrets
 
@@ -178,7 +178,7 @@ The script auto-labels nodes:
 | CPU Worker | `splunk.ai/workload-type: cpu`, `splunk.ai/instance-type: cpu-worker` |
 | GPU Worker | `splunk.ai/workload-type: gpu`, `nvidia.com/gpu: "true"`, taint `nvidia.com/gpu=true:NoSchedule` |
 
-**NVIDIA drivers** are installed directly on GPU nodes (not GPU Operator). Supported: RHEL 9/10, AL2023, Debian/Ubuntu. The script installs kernel headers, CUDA repo, `cuda-drivers`, NVIDIA Container Toolkit, then verifies with `nvidia-smi`.
+**NVIDIA drivers** are installed directly on GPU nodes (not GPU Operator). Supported: RHEL 9 currently. The script installs kernel headers, CUDA repo, `cuda-drivers`, NVIDIA Container Toolkit, then verifies with `nvidia-smi`.
 
 ## 7. Troubleshooting
 
@@ -254,7 +254,7 @@ tail -f tools/cluster_setup/logs/k0s-install-*.log
 | KubeRay Operator | `quay.io/kuberay/operator:v1.2.2` |
 | Prometheus, Grafana, cert-manager, NVIDIA plugin, local-path | Pulled by their respective Helm charts/manifests |
 
-**NVIDIA packages on GPU nodes (RHEL/AL2023/Ubuntu):**
+**NVIDIA packages on GPU nodes (RHEL 9):**
 
 | Package | Source |
 |---------|--------|