Skip to content

Add /snapshot, Fleet, and OpenAPI for RL workloads#28

Closed
dantelex wants to merge 2 commits into
mainfrom
feat/snapshot-and-fleet
Closed

Add /snapshot, Fleet, and OpenAPI for RL workloads#28
dantelex wants to merge 2 commits into
mainfrom
feat/snapshot-and-fleet

Conversation

@dantelex
Copy link
Copy Markdown

What this does

Two new things you can do with the apiserver, plus the OpenAPI contract
that documents them for SDKs.

1. GET /clusters/<id>/control-plane/snapshot

One HTTP call returns everything we're currently watching in a control
plane — straight from in-memory caches, no etcd hit. Fast enough to use
inside a hot RL/agent loop where you want to score a rollout after every
step.

GET /clusters/team-alpha/control-plane/snapshot?resource=pods,configmaps

Optional knobs: resource=, includeEmpty=, warm=.

2. kind: Fleet (kplane.dev/v1)

Declarative N-plane provisioning. You write one YAML, an in-process
controller spins up that many control planes using the same bootstrap
path organic traffic uses, and status.readyReplicas ticks up.

apiVersion: kplane.dev/v1
kind: Fleet
metadata: { name: rl-rollout }
spec:
  replicas: 1000
  namePrefix: rl-

The Fleet CRD installs itself into the root control plane on startup,
so the API is just there once the server boots.

3. OpenAPI

api/openapi/kplane.v1.yaml is the single source of truth for both
endpoints. Embedded into the binary and served live at:

  • GET /openapi/kplane.yaml
  • GET /openapi/kplane.json

These are public on purpose so SDK generators and CI can fetch them
without a token. A drift-check test in api/openapi/spec_test.go fails
the build if any path or schema the SDK depends on gets removed.

This is what the Python SDK
generates against — see the matching Initial SDK commit there.

Why now

The kplane density story (~3MB, ~47ms per plane) is interesting on its
own, but the actual unlock for AI workloads is being able to:

  1. Score a plane's state cheaply (snapshot)
  2. Spin up many planes declaratively (Fleet)

Together they turn kplane into an RL environment substrate. See
docs/snapshot-and-fleet.md for the full
design rationale.

What's intentionally out of scope (V0)

  • No scenario seeding — Fleets create empty planes; users apply
    manifests with their normal K8s client.
  • No TTL-based cleanup — deleting a Fleet leaves member VCPs alive
    (avoids surprise data loss while we settle on finalizer semantics).
  • No snapshots of CRD-defined types — only resources with a live MCI
    show up. warm=true forces creation for registered storages.
  • No async / streaming snapshots — full reads only.

See docs/snapshot-and-fleet.md for the full out-of-scope list.

Commits

  • 3f15822 Add /snapshot and Fleet APIs for RL workloads
  • c27cfeb Add OpenAPI doc for snapshot and Fleet

Test plan

  • go build ./... clean
  • go vet ./... clean
  • Unit tests pass (go test ./pkg/... ./cmd/... ./api/...)
  • OpenAPI drift-check test passes
  • Smoke tests (ETCD_ENDPOINTS=... go test -v ./test/smoke -timeout 10m) — wired in this PR but not run locally; CI's etcd service should cover it
  • Manual: curl https://host/openapi/kplane.yaml returns the spec without auth
  • Manual: kubectl --server=https://host/clusters/X/control-plane apply -f fleet.yaml provisions members

Related

dantelex added 2 commits May 21, 2026 15:49
Two ways to use kplane for AI agent training:

  GET /clusters/<id>/control-plane/snapshot
    Returns everything we're watching in a control plane, in one shot,
    straight from memory. No etcd hit. Use it to score an agent rollout
    or capture a trajectory step. Optional filters: resource=,
    includeEmpty=, warm=.

  kind: Fleet (kplane.dev/v1)
    Declare N control planes and a tiny built-in controller spins them
    up using the same path organic traffic uses. Status reports per-
    member readiness and an aggregate ready count.

Both endpoints sit inside the existing multicluster routing chain, so
they get the same auth, audit, and panic-recovery filters as a regular
K8s request. The Fleet controller installs its own CRD on startup, so
the API is just there once the server boots.

V0 is intentionally minimal: no scenario seeding, no TTL cleanup, no
snapshots of CRD-defined types yet. See docs/snapshot-and-fleet.md for
the full out-of-scope list and reasoning.

Smoke tests for both endpoints live under test/smoke/ and need a real
etcd to run.
Publishes a single OpenAPI 3 file that describes the snapshot endpoint
and the Fleet REST surface. Serves it live at two server-level URLs:

  GET /openapi/kplane.yaml
  GET /openapi/kplane.json

These are public on purpose so SDK generators and CI can fetch them
without a bearer token. They are not cluster-scoped (no /clusters/...
prefix in the path).

The YAML at api/openapi/kplane.v1.yaml is embedded into the binary at
build time, so the served document and the source file cannot drift
apart. A test in api/openapi/spec_test.go enforces that every path and
schema the Python SDK depends on exists in the document.

This unblocks https://github.com/kplane-dev/sdk-python, which embeds
the same YAML and uses it as its contract.
@dantelex dantelex requested a review from zachsmith1 May 21, 2026 19:55
@zachsmith1
Copy link
Copy Markdown
Contributor

separate repo for experimental work like this would be ideal

@zachsmith1 zachsmith1 closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants