Add /snapshot, Fleet, and OpenAPI for RL workloads#28
Closed
dantelex wants to merge 2 commits into
Closed
Conversation
Two ways to use kplane for AI agent training:
GET /clusters/<id>/control-plane/snapshot
Returns everything we're watching in a control plane, in one shot,
straight from memory. No etcd hit. Use it to score an agent rollout
or capture a trajectory step. Optional filters: resource=,
includeEmpty=, warm=.
kind: Fleet (kplane.dev/v1)
Declare N control planes and a tiny built-in controller spins them
up using the same path organic traffic uses. Status reports per-
member readiness and an aggregate ready count.
Both endpoints sit inside the existing multicluster routing chain, so
they get the same auth, audit, and panic-recovery filters as a regular
K8s request. The Fleet controller installs its own CRD on startup, so
the API is just there once the server boots.
V0 is intentionally minimal: no scenario seeding, no TTL cleanup, no
snapshots of CRD-defined types yet. See docs/snapshot-and-fleet.md for
the full out-of-scope list and reasoning.
Smoke tests for both endpoints live under test/smoke/ and need a real
etcd to run.
Publishes a single OpenAPI 3 file that describes the snapshot endpoint and the Fleet REST surface. Serves it live at two server-level URLs: GET /openapi/kplane.yaml GET /openapi/kplane.json These are public on purpose so SDK generators and CI can fetch them without a bearer token. They are not cluster-scoped (no /clusters/... prefix in the path). The YAML at api/openapi/kplane.v1.yaml is embedded into the binary at build time, so the served document and the source file cannot drift apart. A test in api/openapi/spec_test.go enforces that every path and schema the Python SDK depends on exists in the document. This unblocks https://github.com/kplane-dev/sdk-python, which embeds the same YAML and uses it as its contract.
Contributor
|
separate repo for experimental work like this would be ideal |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Two new things you can do with the apiserver, plus the OpenAPI contract
that documents them for SDKs.
1.
GET /clusters/<id>/control-plane/snapshotOne HTTP call returns everything we're currently watching in a control
plane — straight from in-memory caches, no etcd hit. Fast enough to use
inside a hot RL/agent loop where you want to score a rollout after every
step.
Optional knobs:
resource=,includeEmpty=,warm=.2.
kind: Fleet(kplane.dev/v1)Declarative N-plane provisioning. You write one YAML, an in-process
controller spins up that many control planes using the same bootstrap
path organic traffic uses, and
status.readyReplicasticks up.The Fleet CRD installs itself into the root control plane on startup,
so the API is just there once the server boots.
3. OpenAPI
api/openapi/kplane.v1.yamlis the single source of truth for bothendpoints. Embedded into the binary and served live at:
GET /openapi/kplane.yamlGET /openapi/kplane.jsonThese are public on purpose so SDK generators and CI can fetch them
without a token. A drift-check test in
api/openapi/spec_test.gofailsthe build if any path or schema the SDK depends on gets removed.
This is what the Python SDK
generates against — see the matching
Initial SDKcommit there.Why now
The kplane density story (~3MB, ~47ms per plane) is interesting on its
own, but the actual unlock for AI workloads is being able to:
Together they turn kplane into an RL environment substrate. See
docs/snapshot-and-fleet.mdfor the fulldesign rationale.
What's intentionally out of scope (V0)
manifests with their normal K8s client.
(avoids surprise data loss while we settle on finalizer semantics).
show up.
warm=trueforces creation for registered storages.See
docs/snapshot-and-fleet.mdfor the full out-of-scope list.Commits
3f15822Add /snapshot and Fleet APIs for RL workloadsc27cfebAdd OpenAPI doc for snapshot and FleetTest plan
go build ./...cleango vet ./...cleango test ./pkg/... ./cmd/... ./api/...)ETCD_ENDPOINTS=... go test -v ./test/smoke -timeout 10m) — wired in this PR but not run locally; CI's etcd service should cover itcurl https://host/openapi/kplane.yamlreturns the spec without authkubectl --server=https://host/clusters/X/control-plane apply -f fleet.yamlprovisions membersRelated