From e360d6eb639c10ae7dc6d44c8383aa4d3a6cbcd1 Mon Sep 17 00:00:00 2001 From: hlts2 Date: Wed, 8 Apr 2026 10:47:47 +0900 Subject: [PATCH 1/2] fix: add CLAUDE.md Signed-off-by: hlts2 --- CLAUDE.md | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..64fa903 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,60 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Kubernetes node agent for Civo cloud that monitors cluster nodes and triggers automatic hard reboots via the Civo API when nodes become NotReady or lose expected GPU capacity. Deployed as a single-replica Deployment in kube-system via Helm. + +## Build & Test Commands + +```bash +# Build +go build -o node-agent ./ + +# Run all tests +go test ./... + +# Run a single test +go test ./pkg/watcher/ -run TestName + +# Build Docker image (dry-run) +goreleaser release --snapshot --skip=publish --clean +``` + +No linter is configured in CI. + +## Architecture + +**Entrypoint** (`main.go`): Reads env vars, sets up JSON structured logging (slog), creates a Watcher, and runs it with graceful SIGTERM/SIGINT shutdown. + +**Core package** (`pkg/watcher/`): +- `watcher.go` — Main loop polls every 10 seconds. For each node matching the node pool label (`kubernetes.civo.com/civo-node-pool={nodePoolID}`), checks if the node is NotReady or has fewer GPUs than desired. If a reboot is warranted (and cooldown window hasn't elapsed), calls `HardRebootInstance` via the Civo API. +- `options.go` — Functional options pattern (`WithKubernetesClient`, `WithCivoClient`, etc.) for dependency injection and configuration. +- `fake.go` — `FakeClient` implementing `civogo.Clienter` for testing. +- `watcher_test.go` — Tests use fake Kubernetes client (`k8s.io/client-go/kubernetes/fake`) and `FakeClient` for Civo API. + +**Reboot safeguards**: Tracks last reboot time per node in a `sync.Map`. Skips reboot if the node's Ready/NotReady condition transitioned recently or a reboot command was sent within the configurable time window (default 40 minutes). + +## Required Environment Variables + +`CIVO_API_KEY`, `CIVO_REGION`, `CIVO_CLUSTER_ID`, `CIVO_NODE_POOL_ID` — see `.env.example`. + +Optional: `CIVO_API_URL`, `CIVO_NODE_DESIRED_GPU_COUNT`, `CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES`. + +## Deployment + +Helm chart in `charts/`. Secrets are expected in `civo-node-agent` and `civo-api-access` Kubernetes secrets. + +```bash +helm upgrade -n kube-system --install node-agent ./charts +``` + +## Key Dependencies + +- `github.com/civo/civogo` — Civo cloud API client +- `k8s.io/client-go` — Kubernetes client (in-cluster config by default) + +## Release + +Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub. From 768e97571616633124088cf394f8ee71601a50ea Mon Sep 17 00:00:00 2001 From: hlts2 Date: Wed, 8 Apr 2026 10:55:07 +0900 Subject: [PATCH 2/2] docs: add AGENTS.md and symlink CLAUDE.md to it Signed-off-by: hlts2 --- AGENTS.md | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ CLAUDE.md | 61 +------------------------------------------------------ 2 files changed, 61 insertions(+), 60 deletions(-) create mode 100644 AGENTS.md mode change 100644 => 120000 CLAUDE.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..9ddd3b3 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,60 @@ +# AGENTS.md + +This file provides guidance to AI coding agents when working with code in this repository. + +## Project Overview + +Kubernetes node agent for Civo cloud that monitors cluster nodes and triggers automatic hard reboots via the Civo API when nodes become NotReady or lose expected GPU capacity. Deployed as a single-replica Deployment in kube-system via Helm. + +## Build & Test Commands + +```bash +# Build +go build -o node-agent ./ + +# Run all tests +go test ./... + +# Run a single test +go test ./pkg/watcher/ -run TestName + +# Build Docker image (dry-run) +goreleaser release --snapshot --skip=publish --clean +``` + +No linter is configured in CI. + +## Architecture + +**Entrypoint** (`main.go`): Reads env vars, sets up JSON structured logging (slog), creates a Watcher, and runs it with graceful SIGTERM/SIGINT shutdown. + +**Core package** (`pkg/watcher/`): +- `watcher.go` — Main loop polls every 10 seconds. For each node matching the node pool label (`kubernetes.civo.com/civo-node-pool={nodePoolID}`), checks if the node is NotReady or has fewer GPUs than desired. If a reboot is warranted (and cooldown window hasn't elapsed), calls `HardRebootInstance` via the Civo API. +- `options.go` — Functional options pattern (`WithKubernetesClient`, `WithCivoClient`, etc.) for dependency injection and configuration. +- `fake.go` — `FakeClient` implementing `civogo.Clienter` for testing. +- `watcher_test.go` — Tests use fake Kubernetes client (`k8s.io/client-go/kubernetes/fake`) and `FakeClient` for Civo API. + +**Reboot safeguards**: Tracks last reboot time per node in a `sync.Map`. Skips reboot if the node's Ready/NotReady condition transitioned recently or a reboot command was sent within the configurable time window (default 40 minutes). + +## Required Environment Variables + +`CIVO_API_KEY`, `CIVO_REGION`, `CIVO_CLUSTER_ID`, `CIVO_NODE_POOL_ID` — see `.env.example`. + +Optional: `CIVO_API_URL`, `CIVO_NODE_DESIRED_GPU_COUNT`, `CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES`. + +## Deployment + +Helm chart in `charts/`. Secrets are expected in `civo-node-agent` and `civo-api-access` Kubernetes secrets. + +```bash +helm upgrade -n kube-system --install node-agent ./charts +``` + +## Key Dependencies + +- `github.com/civo/civogo` — Civo cloud API client +- `k8s.io/client-go` — Kubernetes client (in-cluster config by default) + +## Release + +Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub. diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 64fa903..0000000 --- a/CLAUDE.md +++ /dev/null @@ -1,60 +0,0 @@ -# CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## Project Overview - -Kubernetes node agent for Civo cloud that monitors cluster nodes and triggers automatic hard reboots via the Civo API when nodes become NotReady or lose expected GPU capacity. Deployed as a single-replica Deployment in kube-system via Helm. - -## Build & Test Commands - -```bash -# Build -go build -o node-agent ./ - -# Run all tests -go test ./... - -# Run a single test -go test ./pkg/watcher/ -run TestName - -# Build Docker image (dry-run) -goreleaser release --snapshot --skip=publish --clean -``` - -No linter is configured in CI. - -## Architecture - -**Entrypoint** (`main.go`): Reads env vars, sets up JSON structured logging (slog), creates a Watcher, and runs it with graceful SIGTERM/SIGINT shutdown. - -**Core package** (`pkg/watcher/`): -- `watcher.go` — Main loop polls every 10 seconds. For each node matching the node pool label (`kubernetes.civo.com/civo-node-pool={nodePoolID}`), checks if the node is NotReady or has fewer GPUs than desired. If a reboot is warranted (and cooldown window hasn't elapsed), calls `HardRebootInstance` via the Civo API. -- `options.go` — Functional options pattern (`WithKubernetesClient`, `WithCivoClient`, etc.) for dependency injection and configuration. -- `fake.go` — `FakeClient` implementing `civogo.Clienter` for testing. -- `watcher_test.go` — Tests use fake Kubernetes client (`k8s.io/client-go/kubernetes/fake`) and `FakeClient` for Civo API. - -**Reboot safeguards**: Tracks last reboot time per node in a `sync.Map`. Skips reboot if the node's Ready/NotReady condition transitioned recently or a reboot command was sent within the configurable time window (default 40 minutes). - -## Required Environment Variables - -`CIVO_API_KEY`, `CIVO_REGION`, `CIVO_CLUSTER_ID`, `CIVO_NODE_POOL_ID` — see `.env.example`. - -Optional: `CIVO_API_URL`, `CIVO_NODE_DESIRED_GPU_COUNT`, `CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES`. - -## Deployment - -Helm chart in `charts/`. Secrets are expected in `civo-node-agent` and `civo-api-access` Kubernetes secrets. - -```bash -helm upgrade -n kube-system --install node-agent ./charts -``` - -## Key Dependencies - -- `github.com/civo/civogo` — Civo cloud API client -- `k8s.io/client-go` — Kubernetes client (in-cluster config by default) - -## Release - -Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 0000000..47dc3e3 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file