feat: harden scheduling/reconciliation observability, add safe VM artifact cleanup, and upgrade client + CI/CD workflows by miladhzzzz · Pull Request #3 · persys-dev/compute-agent

miladhzzzz · 2026-02-17T18:09:38Z

Summary

This PR improves reliability, traceability, and delivery workflows across the agent by:

preserving failed scheduling intent for reconciliation
exposing action history via API
adding safe VM disk/cloud-init cleanup on delete
expanding the example client into a full test harness (full + minimal specs)
optimizing Docker image build strategy
expanding CI with linting, proto drift checks, tests, build verification, and Docker target builds

What Changed

1) Scheduling/Reconciliation Reliability

Failed applies no longer drop workload intent from state.
Failed status is persisted with metadata (error context, timestamps) so failures remain inspectable.
Reconciliation can now recreate workloads when runtime resources are missing (unknown/not found) instead of silently skipping.
Retry tracker behavior remains intact for transient failures, with clearer metadata.

2) Action History API

Added ListActions RPC to return action/task history since startup.
Added filtering/sorting/limit support (workload_id, action_type, status, limit, newest_first).
Server maps task queue snapshots to API actions.
Example client now supports -action list-actions.

3) Task Queue Correctness

Fixed enqueue failure behavior so rejected submissions do not leave ghost task records.
Added tests for full-queue rejection cleanup.

4) VM Delete Safety (Disk Cleanup)

Implemented safe cleanup policy:
- remove deterministic cloud-init ISO artifacts
- remove only agent-managed VM disks (tracked via marker files)
- do not remove user-provided/external disks
Added parsing/cleanup helpers and focused unit tests for marker-based deletion behavior.

5) Example Client Expansion

Added comprehensive runtime flags for real container/VM tests:
- container image/env/ports/volumes/resources/restart policy, etc.
- VM vCPU/memory/cloud-init/disks/networks/metadata, plus -spec-file
Added robust spec parsing helpers.
Added spec packs and scripts under examples/client/:
- full specs: container/compose/vm
- minimal specs: container/compose/vm
- raw helpers: compose*.yaml, cloud-init*.yaml
- scripts:
  - encode-compose.sh (base64 + optional JSON update)
  - quick-test.sh (smoke test apply/status/list-actions/delete for container/compose/(optional vm))

6) Dockerfile Optimization

Reworked Dockerfile into cache-friendly multi-stage build with BuildKit mounts.
Added runtime targets:
- runtime (optimized default)
- full-runtime (includes additional local daemon tooling)
Reduced unnecessary build/runtime overhead and improved CI build repeatability.

7) CI Workflow Upgrade

Replaced minimal CI with a comprehensive pipeline:
- lint-and-validate: gofmt, shell syntax, golangci-lint
- proto-drift-check: regenerate protobuf + fail on drift
- unit-tests: race + coverage artifact
- e2e-tests
- build-binaries: agent + client compile checks
- docker-build: validates both Docker targets with Buildx cache

8) Control Plane Communication via mTLS + gRPC

Added Control Plane Proto for communication between Scheduler <-> Agent

Validation

Full test suite passes locally:
- go test ./...

Impact

Better operational confidence during scheduler failures
Stronger auditability/debuggability via action history
Safer VM lifecycle cleanup (no accidental external disk deletion)
Faster, more deterministic CI/CD feedback loops
Easier real-world workload testing from the example client

…chestration

…rkload situation

chatgpt-codex-connector · 2026-02-17T18:09:44Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

miladhzzzz added 9 commits February 17, 2026 21:27

Feat: Add Actions list endpint to gRPC

ba59a7c

Feat: Full Feauture Client Test with Spec file suppot for workload or…

220c407

…chestration

Chore: Workload Spec-File Samples and how to use them

145df14

Feat: Add Action List Endpoint + Workload Tracking

940d7e0

Feat: Cleanup vm Disk After Workload Deletion

cbd11a8

Update: Better Task Tracking + List Actions

bf655ed

Feat: Add Failed Workload Tracking and management to avoid phantom wo…

49dcfc9

…rkload situation

Chore: Optimize Docker image to be nimble

f36504b

Feat: Better CI stages

614cede

miladhzzzz added 16 commits February 17, 2026 21:43

Chore: Ran Go FMT to fix CI error

d3c8f14

Chore: Disable Lint Stage temporarily

677a98d

Chore: Fix Dockerfile broken apt download

4d3fa3e

Chore: Update Makefile to include build for control plane proto files

6e4a065

Chore: Update Readme

4188db4

Feat: Add Control Plane Proto Contract

86ee79e

Feat: Add Agent Standalone Mode Flag to Binary

f1f6558

Chore: Add integration doc and update getting started

b04ddfe

Chore: Fix Health request to show proper output

f54302b

Chore: Add Spec Files for workload apply test via smoke client

d3a12ee

Chore: Add new envs and labels to agent config

d6b52a8

Feat: Add Control Plane gRPC Client

2328463

Fix: Garbage Collector Deleting non persys managed resources

c5362f7

Feat: Add Label Markers to persys workloads for GC

78e8030

Feat: proto for communicating to scheduler (control plane)

5007dc3

Fix: proto drift error in ci

da78de6

miladhzzzz merged commit 5b967af into main Feb 18, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden scheduling/reconciliation observability, add safe VM artifact cleanup, and upgrade client + CI/CD workflows#3

feat: harden scheduling/reconciliation observability, add safe VM artifact cleanup, and upgrade client + CI/CD workflows#3
miladhzzzz merged 25 commits intomainfrom
Improve-Client-Test-CI-Queue-Manager

miladhzzzz commented Feb 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miladhzzzz commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

1) Scheduling/Reconciliation Reliability

2) Action History API

3) Task Queue Correctness

4) VM Delete Safety (Disk Cleanup)

5) Example Client Expansion

6) Dockerfile Optimization

7) CI Workflow Upgrade

8) Control Plane Communication via mTLS + gRPC

Validation

Impact

Uh oh!

chatgpt-codex-connector bot commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miladhzzzz commented Feb 17, 2026 •

edited

Loading