Skip to content

feat: harden scheduling/reconciliation observability, add safe VM artifact cleanup, and upgrade client + CI/CD workflows#3

Merged
miladhzzzz merged 25 commits intomainfrom
Improve-Client-Test-CI-Queue-Manager
Feb 18, 2026
Merged

feat: harden scheduling/reconciliation observability, add safe VM artifact cleanup, and upgrade client + CI/CD workflows#3
miladhzzzz merged 25 commits intomainfrom
Improve-Client-Test-CI-Queue-Manager

Conversation

@miladhzzzz
Copy link
Contributor

@miladhzzzz miladhzzzz commented Feb 17, 2026

Summary

This PR improves reliability, traceability, and delivery workflows across the agent by:

  • preserving failed scheduling intent for reconciliation
  • exposing action history via API
  • adding safe VM disk/cloud-init cleanup on delete
  • expanding the example client into a full test harness (full + minimal specs)
  • optimizing Docker image build strategy
  • expanding CI with linting, proto drift checks, tests, build verification, and Docker target builds

What Changed

1) Scheduling/Reconciliation Reliability

  • Failed applies no longer drop workload intent from state.
  • Failed status is persisted with metadata (error context, timestamps) so failures remain inspectable.
  • Reconciliation can now recreate workloads when runtime resources are missing (unknown/not found) instead of silently skipping.
  • Retry tracker behavior remains intact for transient failures, with clearer metadata.

2) Action History API

  • Added ListActions RPC to return action/task history since startup.
  • Added filtering/sorting/limit support (workload_id, action_type, status, limit, newest_first).
  • Server maps task queue snapshots to API actions.
  • Example client now supports -action list-actions.

3) Task Queue Correctness

  • Fixed enqueue failure behavior so rejected submissions do not leave ghost task records.
  • Added tests for full-queue rejection cleanup.

4) VM Delete Safety (Disk Cleanup)

  • Implemented safe cleanup policy:
    • remove deterministic cloud-init ISO artifacts
    • remove only agent-managed VM disks (tracked via marker files)
    • do not remove user-provided/external disks
  • Added parsing/cleanup helpers and focused unit tests for marker-based deletion behavior.

5) Example Client Expansion

  • Added comprehensive runtime flags for real container/VM tests:
    • container image/env/ports/volumes/resources/restart policy, etc.
    • VM vCPU/memory/cloud-init/disks/networks/metadata, plus -spec-file
  • Added robust spec parsing helpers.
  • Added spec packs and scripts under examples/client/:
    • full specs: container/compose/vm
    • minimal specs: container/compose/vm
    • raw helpers: compose*.yaml, cloud-init*.yaml
    • scripts:
      • encode-compose.sh (base64 + optional JSON update)
      • quick-test.sh (smoke test apply/status/list-actions/delete for container/compose/(optional vm))

6) Dockerfile Optimization

  • Reworked Dockerfile into cache-friendly multi-stage build with BuildKit mounts.
  • Added runtime targets:
    • runtime (optimized default)
    • full-runtime (includes additional local daemon tooling)
  • Reduced unnecessary build/runtime overhead and improved CI build repeatability.

7) CI Workflow Upgrade

  • Replaced minimal CI with a comprehensive pipeline:
    • lint-and-validate: gofmt, shell syntax, golangci-lint
    • proto-drift-check: regenerate protobuf + fail on drift
    • unit-tests: race + coverage artifact
    • e2e-tests
    • build-binaries: agent + client compile checks
    • docker-build: validates both Docker targets with Buildx cache

8) Control Plane Communication via mTLS + gRPC

  • Added Control Plane Proto for communication between Scheduler <-> Agent

Validation

  • Full test suite passes locally:
    • go test ./...

Impact

  • Better operational confidence during scheduler failures
  • Stronger auditability/debuggability via action history
  • Safer VM lifecycle cleanup (no accidental external disk deletion)
  • Faster, more deterministic CI/CD feedback loops
  • Easier real-world workload testing from the example client

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@miladhzzzz miladhzzzz merged commit 5b967af into main Feb 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant