Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions docs/kubernetes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Kubernetes Architecture and Configuration

The Kubernetes deployment runs DSV as a leaderless app cluster with per-node Redis sidecars and a Kafka broker.
The Kubernetes deployment runs DSV as a leaderless app cluster with per-node Redis sidecars and a Kafka broker. All manifests use the **`dsv`** namespace.

## High-Level Architecture

Expand Down Expand Up @@ -46,18 +46,18 @@ This keeps shard storage local to the DSV pod while still allowing Kubernetes to

`dsv-app-headless` is a headless service that returns DNS records for app pods. ScaleCube uses:

- `SEED_DNS_HOST=dsv-app-headless.default.svc.cluster.local`
- `SEED_DNS_HOST=dsv-app-headless.dsv.svc.cluster.local`
- `SEED_DNS_PORT=4801`
- `CLUSTER_PORT=4801`

The app service `dsv-app-service` separately provides load-balanced HTTP traffic.
The app service `dsv-app-service` separately provides load-balanced HTTP traffic on port **9080** (container `SERVER_PORT=9080`, avoiding common **8080** conflicts).

## Kafka

Kafka runs as a single-broker KRaft StatefulSet in the current manifests. DSV app pods connect through:

```text
KAFKA_BOOTSTRAP_SERVERS=kafka.default.svc.cluster.local:9092
KAFKA_BOOTSTRAP_SERVERS=kafka.dsv.svc.cluster.local:9092
```

## Testing Environment
Expand All @@ -71,12 +71,15 @@ kubectl get pods -w

## Production Environment

`k8s/production` keeps scheduling controls for a multi-node target:
`k8s/production` targets a multi-node cluster (e.g. 3 control-plane + 10 workers):

- app pods avoid control-plane nodes
- app pods use pod anti-affinity to spread across worker nodes
- the app StatefulSet requests up to 12 replicas
- app pods use pod anti-affinity (one DSV pod per worker)
- **10 replicas** with `cluster.totalNodes=10`, `thresholdK=6`, `quorumM=6`

Full steps (scp manifests, import image on all workers, verify): [production-kubernetes-deploy.md](production-kubernetes-deploy.md).

```bash
kubectl apply -f k8s/production/ --dry-run=client
kubectl apply -f k8s/production/
```
6 changes: 3 additions & 3 deletions docs/local-crud-powershell.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Local CRUD Smoke Test (PowerShell)

These commands assume the local Kubernetes service is port-forwarded to `127.0.0.1:8080`.
These commands assume the local Kubernetes service is port-forwarded to `127.0.0.1:9080`.

Start the port-forward in one PowerShell terminal:

```powershell
kubectl port-forward service/dsv-app-service 8080:80
kubectl port-forward -n dsv service/dsv-app-service 9080:9080
```

Run the CRUD commands in another PowerShell terminal:

```powershell
$BASE = "http://127.0.0.1:8080"
$BASE = "http://127.0.0.1:9080"
```

## Health
Expand Down
178 changes: 178 additions & 0 deletions docs/production-kubernetes-deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Production deployment on Kubernetes (k3s)

Guide for deploying Distributed Secrets Vault to a cluster with **3 control-plane nodes** and **10 worker nodes**, using manifests in `k8s/production/`.

All resources run in the **`dsv`** namespace (`namespace.yaml` is applied first).

## What gets deployed

| Resource | Count | Scheduling |
|----------|-------|------------|
| `dsv-app` StatefulSet | 10 pods | Workers only; one pod per host (anti-affinity) |
| `kafka` StatefulSet | 1 pod | Workers only |
| PVCs | 11 × RWO | 10 × Redis (5Gi) + 1 × Kafka (10Gi) |
| Services + Ingress | Traefik → `dsv-app-service` | HTTP API on port **9080** |

Cluster parameters (10 nodes, Shamir **k=6**, write quorum **m=6**) are set via `JAVA_TOOL_OPTIONS` in `app-statefulset.yaml`.

## Prerequisites

- `kubectl` on the machine you deploy from, with a valid kubeconfig for the cluster
- Built image tagged **`dsv-backend:latest`** on **every worker** that can run `dsv-app` (see [Image distribution](#image-distribution))
- Default **StorageClass** for dynamic PVCs (k3s: `local-path` is typical)
- Traefik ingress controller (bundled with k3s)

## 1. Build the application image (build machine)

```bash
cd DistributedSecretsVault
./mvnw clean package -DskipTests
mkdir -p target/dependency && (cd target/dependency && jar -xf ../*.jar)
docker build -t dsv-backend:latest .
docker save dsv-backend:latest | gzip -9 > dsv-backend.tar.gz
```

## 2. Copy manifests to the remote machine (scp)

From your laptop (or CI), upload the production YAML directory:

```bash
scp -r k8s/production user@REMOTE:/tmp/dsv-k8s/
```

Optional: upload the image tarball to a jump host or each worker:

```bash
scp dsv-backend.tar.gz user@REMOTE:/tmp/
```

You can apply with `kubectl` from any host that has cluster access; the YAML does not need to live on a control-plane node.

## 3. Load the image on all workers

DSV pods only schedule on **workers**. Import the image on **each of the 10 workers** (not required on control-plane nodes unless they run workloads).

On each worker (after `scp` of the tarball):

```bash
gunzip -c /tmp/dsv-backend.tar.gz | sudo k3s ctr images import -
sudo k3s ctr images ls | grep dsv-backend
```

Loop from your workstation (adjust hosts and SSH user):

```bash
for host in worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10; do
scp dsv-backend.tar.gz user@${host}:/tmp/
ssh user@${host} 'gunzip -c /tmp/dsv-backend.tar.gz | sudo k3s ctr images import -'
done
```

**Alternative:** push to a private registry, set `image:` in `app-statefulset.yaml`, and add `imagePullSecrets` if needed.

## 4. Validate manifests (before apply)

On the remote machine with `kubectl`:

```bash
kubectl apply -f /tmp/dsv-k8s/ --dry-run=client
kubectl diff -f /tmp/dsv-k8s/ # optional; shows changes if upgrading
```

Check cluster capacity:

```bash
kubectl get nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master'
# Expect 10 Ready workers

kubectl get storageclass
# Expect a default StorageClass for PVCs
```

## 5. Deploy (order matters for first install)

```bash
kubectl apply -f /tmp/dsv-k8s/namespace.yaml
kubectl apply -f /tmp/dsv-k8s/kafka-service.yaml
kubectl apply -f /tmp/dsv-k8s/kafka-statefulset.yaml
kubectl wait -n dsv --for=condition=ready pod/kafka-0 --timeout=300s

kubectl apply -f /tmp/dsv-k8s/app-service.yaml
kubectl apply -f /tmp/dsv-k8s/app-statefulset.yaml
kubectl apply -f /tmp/dsv-k8s/ingress.yaml
```

Or apply everything at once (Kafka may restart apps until it is ready):

```bash
kubectl apply -f /tmp/dsv-k8s/
kubectl get pods -n dsv -w
```

Expect:

- `kafka-0` → Running on a worker
- `dsv-app-0` … `dsv-app-9` → Running (each: `dsv-app` + `redis-sidecar`)

## 6. Verify

```bash
kubectl get pods -n dsv -o wide
kubectl get pvc -n dsv
kubectl get ingress -n dsv dsv-ingress
```

Port-forward (if ingress is not exposed yet):

```bash
kubectl port-forward -n dsv svc/dsv-app-service 9080:9080
curl -s http://127.0.0.1:9080/actuator/health | jq .
curl -s http://127.0.0.1:9080/api/v1/cluster/status | jq .
curl -s http://127.0.0.1:9080/api/v1/cluster/nodes | jq .
```

`cluster/status` should report **10** healthy nodes once ScaleCube has formed the cluster.

Via Traefik (k3s default):

```bash
curl -s http://<any-worker-or-lb-ip>/actuator/health
```

## Headlamp

1. Open the cluster in Headlamp and select the **`dsv`** namespace.
2. **Workloads** → StatefulSets: confirm `dsv-app` (10/10) and `kafka` (1/1).
3. **Storage** → PVCs: all Bound.
4. **Network** → Services / Ingress: `dsv-app-service`, `dsv-ingress`.
5. Use **Pod logs** (`dsv-app` container) if a pod is not Ready.
6. **Port forward** `dsv-app-service` port **9080** for API tests without DNS.

## Tuning

| Setting | Location | Notes |
|---------|----------|--------|
| Replica count | `app-statefulset.yaml` `replicas` | Match worker count for one pod per node |
| Shamir / quorum | `JAVA_TOOL_OPTIONS` | For `N` nodes: `k = N/2+1`, `m = k` (see demo script) |
| Ingress host/TLS | `ingress.yaml` | Add `host` and `tls` for production DNS |
| Image registry | `app-statefulset.yaml` `image` | Replace `dsv-backend:latest` |
| Storage size | PVC templates | Redis 5Gi, Kafka 10Gi defaults |

## Troubleshooting

| Symptom | Cause | Fix |
|---------|--------|-----|
| `ErrImagePull` / `ImagePullBackOff` | Image missing on that worker | Import `dsv-backend.tar.gz` on that node |
| `dsv-app-*` Pending | Anti-affinity or no workers | Need 10 schedulable workers; check `kubectl describe pod` |
| PVC Pending | No StorageClass | Install/configure provisioner (e.g. `local-path`) |
| 503 on writes | Cluster not fully up | Wait for 10 pods; check `/api/v1/cluster/nodes` |
| Ingress 404 | Wrong class | `kubectl get ingressclass`; set `ingressClassName: traefik` |

## Teardown

```bash
kubectl delete -f /tmp/dsv-k8s/
# PVCs are retained by default; delete manually if you need a clean slate:
# kubectl delete pvc -l app=dsv-app
# kubectl delete pvc -l app=kafka
```
17 changes: 12 additions & 5 deletions k8s/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ This directory contains Kubernetes manifests for Distributed Secrets Vault. The
```text
k8s/
├── production/
│ ├── namespace.yaml
│ ├── app-service.yaml
│ ├── app-statefulset.yaml
│ ├── ingress.yaml
│ ├── kafka-service.yaml
│ └── kafka-statefulset.yaml
├── testing/
│ ├── namespace.yaml
│ ├── app-service.yaml
│ ├── app-statefulset.yaml
│ ├── ingress.yaml
Expand All @@ -21,13 +23,15 @@ k8s/
└── README.md
```

All resources use the **`dsv`** namespace.

## Architecture

- `dsv-app` is a StatefulSet.
- Redis runs as a sidecar inside every `dsv-app` pod and persists data through a per-pod PVC.
- `dsv-app-headless` exposes pod DNS records for ScaleCube peer discovery.
- `dsv-app-service` load-balances HTTP traffic to healthy app pods.
- Kafka is available at `kafka.default.svc.cluster.local:9092`.
- `dsv-app-service` load-balances HTTP traffic to healthy app pods on port **9080** (avoids common **8080** conflicts).
- Kafka is available at `kafka.dsv.svc.cluster.local:9092`.

The production manifests keep the one-app-pod-per-worker-node placement strategy through node affinity and pod anti-affinity. The testing manifests remove those scheduling constraints for Docker Desktop, Minikube, or K3d.

Expand All @@ -45,18 +49,21 @@ Then deploy:

```bash
kubectl apply -f k8s/testing/
kubectl get pods -w
kubectl get pods -n dsv -w
```

The testing app manifest uses `imagePullPolicy: Never`, so the image must exist in the local cluster's Docker image store.

## Production

Tuned for **10 worker nodes** (one `dsv-app` pod per worker, Shamir k=6 / quorum m=6). See [docs/production-kubernetes-deploy.md](../docs/production-kubernetes-deploy.md) for scp, image import on all workers, and verification steps.

```bash
kubectl apply -f k8s/production/ --dry-run=client
kubectl apply -f k8s/production/
```

Before production use, replace placeholder image and ingress details with the registry image and hostnames for the target cluster.
Before production use, load `dsv-backend:latest` on every worker (or switch to a registry image) and set ingress host/TLS as needed.

## App Environment

Expand All @@ -75,7 +82,7 @@ Before production use, replace placeholder image and ingress details with the re

ScaleCube startup is DNS-based:

- `SEED_DNS_HOST=dsv-app-headless.default.svc.cluster.local`
- `SEED_DNS_HOST=dsv-app-headless.dsv.svc.cluster.local`
- `SEED_DNS_PORT=4801`
- `CLUSTER_PORT=4801`

Expand Down
14 changes: 8 additions & 6 deletions k8s/production/app-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,21 @@ apiVersion: v1
kind: Service
metadata:
name: dsv-app-service
namespace: default
namespace: dsv
spec:
type: ClusterIP
selector:
app: dsv-app
ports:
- port: 80
targetPort: 8080
- name: http
port: 9080
targetPort: http
---
apiVersion: v1
kind: Service
metadata:
name: dsv-app-headless
namespace: default
namespace: dsv
spec:
# Headless service for DNS lookups of all agent nodes
type: ClusterIP
Expand All @@ -26,5 +27,6 @@ spec:
- name: cluster
port: 4801
targetPort: 4801
- port: 8080
targetPort: 8080
- name: http
port: 9080
targetPort: http
Loading
Loading