Last Updated: 2025-10-31 Version: 1.2
| Window | Theme | Highlights |
|---|---|---|
| 2025-10 | Completed foundations | Secret manager refactor (phases 1-4), CLI & route management hardening, QUIC/HTTP3 enablement, Wazuh SSO P0 fixes |
| 2025-11 | Immediate priorities | Hecate self-enrollment Phase 1, Wazuh SSO P1 hardening, Secret manager Phases 5-6, Config management Phase 0, Authentik client consolidation kickoff |
| 2025-12 → 2026-01 | Near-term delivery | Environment automation Phase 1, Authentik API migration (P2), Backup & restore program launch |
| 2026-02 → 2026-04 | Mid-term focus | Hecate authentication Phase 2, Environment automation Phases 2-3, Hecate Consul/Vault integration |
| 2026-Q2+ | Strategic backlog | Backup & restore advanced features, Caddy/Authentik automation backlog, observability & resilience investments |
Use the dated sections below for sequencing, dependencies, and detailed task lists. Completed work is preserved for traceability and informs risk posture for upcoming phases.
- Unified the
SecretStoreinterface inpkg/secrets/store.go, added Vault/Consul/file adapters, and ensured every method acceptscontext.Context. - Replaced the legacy
SecretManagerwithManager(pkg/secrets/manager.go), added context-aware helpers, and shipped deprecated aliases to preserve caller compatibility. - Fixed Vault diagnostic path issues (
pkg/debug/bionicgpt/vault_config_diagnostic.go:45-47) with regression guards inpkg/secrets/vault_store.go#L78-L81. - Migrated seven services to the new API (
pkg/bionicgpt/install.go:256,cmd/create/umami.go:48,cmd/create/temporal.go:57,pkg/cephfs/client.go:68, etc.); build, vet, and gofmt all green. - Hecate-specific migration (Phase 4.2) deferred to November; deprecation notice scheduled for 2025-11 once Secret Manager Phases 5-6 are complete.
- Normalised flag experience for
sudo eos update hecate --add ...(cmd/update/hecate.go:19-134), auto-appending known ports and rejecting positional args. - Added
ValidateNoFlagLikeArgs()guard (cmd/update/hecate_add.go:74-77) and telemetry flag to distinguish invocation modes (pkg/hecate/add/types.go:24). - Removed duplicated logging and aligned both flag and subcommand paths with Admin API-first flow; ten automated tests cover common permutations with IPv6/port edge cases.
- Refined Cobra wiring so orchestration layers delegate business logic cleanly; logging follows CLAUDE.md guidance across
cmd/update/hecate.goand related helpers. - Ensured
isAdminAPIAvailable()gating and fallback logic produce zero-downtime reloads before reverting to file-based updates.
- Opened UDP/443 in both UFW (
pkg/hecate/yaml_generator.go:979-1012) and Hetzner Terraform rules (pkg/hecate/terraform_templates.go:85-92). - Documented verification checklist (sysctl,
ufw status,ss -ulnp,curl --http3) so platform teams can validate HTTP/3 reachability aftereos create hecate.
- Replaced direct
fmt.Print*usage with structured logging across interaction helpers, documenting explicit exceptions forPromptSecret. - Added a 20-case unit suite for
validateYesNoResponseand refreshed README/ADR notes clarifying when stdout is acceptable.
- Removed hardcoded paths/permissions, tracked rollback metadata, and codified constants in
pkg/wazuh/sso_sync.go. - Established clean baseline for P1 security improvements (see November plan).
- Retained
pkg/hecate/add/bionicgpt_fix.go.DEPRECATEDfor historical context; Admin API-driven workflow now production-ready.
- Introduced declarative service command group (
cmd/service) with list/init/health/status/reset/logs entry points wired througheos.Wrap. - Added definition loader and discovery utilities (
internal/service/definition.go) plus execution placeholder to retain compile-time coverage while downstream phases land. - Published baseline Langfuse definition (
services/langfuse.yaml) so dependency validation and CLI surfacing can be exercised immediately ahead of executor delivery.
- Verify critical Authentik behaviours against current source (v2025.10) rather than relying solely on documentation.
- Telemetry and fallbacks should ship alongside API migrations to keep rollouts observable.
- Strict logging policies need explicit, well-documented exceptions to remain sustainable.
Context: Three security hardening sprints conducted between January and November 2025, addressing token exposure, TLS validation, input sanitization, and establishing shift-left prevention framework.
- CVSS: 8.5 (High) → 0.0 (Fixed)
- Issue: Vault root tokens exposed in environment variables (
VAULT_TOKEN=<value>), visible viaps auxeand/proc/<pid>/environ - Fix: Created
pkg/vault/cluster_token_security.gowith temporary token file pattern (0400 permissions, immediate cleanup) - Functions Fixed: 5 functions in
pkg/vault/cluster_operations.go(ConfigureRaftAutopilot, GetAutopilotState, RemoveRaftPeer, TakeRaftSnapshot, RestoreRaftSnapshot) - Pattern:
VAULT_TOKEN_FILE=/tmp/vault-token-<random>instead ofVAULT_TOKEN=<value> - Tests: 6 test cases with 100% coverage of security-critical paths
- Compliance: NIST 800-53 SC-12, AC-3; PCI-DSS 3.2.1
- CVSS: 9.1 (Critical) → 0.0 (Fixed)
- Issue:
VAULT_SKIP_VERIFY=1set unconditionally inpkg/vault/phase2_env_setup.go:92, enabling MitM attacks - Fix: Implemented CA certificate discovery with informed consent framework
- Components:
locateVaultCACertificate()- searches/etc/vault/tls/ca.crt,/etc/eos/ca.crt,/etc/ssl/certs/vault-ca.pemhandleTLSValidationFailure()- requires explicit user consent orEos_ALLOW_INSECURE_VAULT=trueisInteractiveTerminal()- TTY detection for safe prompting
- Behavior: TLS validation enabled by default, bypass only with consent (dev mode) or CA cert unavailable + user approval
- Compliance: NIST 800-53 SC-8, SC-13; PCI-DSS 4.1
- Purpose: Prevent P0-1/P0-2 regression through automated validation
- Three-Layer Defense:
- Pre-commit hook (
.git/hooks/pre-commit): 6 security checks (hardcoded secrets, VAULT_SKIP_VERIFY, InsecureSkipVerify, VAULT_TOKEN env vars, hardcoded permissions, security TODOs) - CI/CD workflow (
.github/workflows/security.yml): gosec, govulncheck, TruffleHog secret scanning, SARIF upload - Security review checklist (
docs/SECURITY_REVIEW_CHECKLIST.md): Human-centric process for code reviews
- Pre-commit hook (
- Philosophy: "Shift Left" - catch security issues at development time, not code review time
- Success Metrics: Zero P0-1/P0-2 regressions detected since implementation
- Issue: Invalid branch names and missing git identity caused repository creation failures
- Fixes:
ValidateBranchName()- implements all 10 git-check-ref-format rulessanitizeInput()- defense against terminal escape sequence injection (CVE-2024-56803, CVE-2024-58251 class)ValidateRepoName()- blocks 20+ Gitea reserved names, path traversal, SQL injection- Enhanced git identity check with RFC 5322 email validation
- Forensic debug logging via
EOS_DEBUG_INPUT=1
- Test Coverage: 63 test cases across branch validation (25), repo validation (28), input sanitization (10)
- Deployment: Added
make deploytargets for atomic binary swap to production servers
Context: Comprehensive adversarial security analysis identified 8 categories of P0 violations across 363 command files, requiring systematic remediation in 4 prioritized phases.
Scope: Full codebase scan using OWASP, NIST 800-53, CIS Benchmarks, STRIDE methodology
Critical Issues Identified (P0-Breaking):
-
Flag Bypass Vulnerability (CVE-worthy): Only 6/363 commands (1.7%) implement
ValidateNoFlagLikeArgs()security check- Attack:
eos delete env production -- --forcebypasses safety checks via--separator - Impact: Production deletion, running VM deletion, emergency overrides can be bypassed
- Remediation: Add validation to 357 unprotected commands (12 hours, scriptable)
- Attack:
-
Hardcoded File Permissions (Compliance Risk):
732 violations→ 0 violations (100% COMPLETE) ✅- Status: COMPLETED 2025-11-13 across 6 commits (b8fcabf, a22f4bf, 0276e75, c635bbd, 92a552d, 91c6bf1, 753bd37)
- Coverage: 331/331 production violations fixed (732 original count included tests/comments - refined to 331 actual violations)
- Breakdown: 255 generic packages, 49 vault package, 15 consul package, 9 nomad package, 2 vault constants array, 2 intentional exceptions documented
- Architecture: TWO-TIER pattern implemented - shared constants (pkg/shared/permissions.go: 11 constants) + service-specific (pkg/vault/constants.go: 31 constants, pkg/consul/constants.go: 7 constants)
- Compliance: SOC2 CC6.1, PCI-DSS 8.2.1, HIPAA 164.312(a)(1) - all permission constants include documented security rationale
- Exceptions: 2 intentional bitwise operations documented (cmd/read/check.go:75, cmd/backup/restore.go:175) - excluded from remediation as they're dynamic mode modifications, not hardcoded permissions
- Circular Imports: Resolved via local constant duplication in consul subpackages (validation, config, service, acl) with NOTE comments explaining avoidance strategy
- Compliance Artifacts (commit 753bd37):
- Evidence Matrix: pkg/vault/constants.go (lines 391-506) - maps 31 constants to SOC2/PCI-DSS/HIPAA controls, audit-ready documentation
- Sync Verification: scripts/verify_constant_sync.sh - automated drift detection for 11 duplicated constants, CI/CD ready, all checks pass
- Verification (commit 91c6bf1): golangci-lint v2 compatibility fixed, linter runs successfully (52 pre-existing issues documented), build produces 93M ELF executable
- Test Status: Pre-existing failures documented (vault auth timeout, shared fuzz timeout) - unrelated to P0-2, all permission code compiles successfully
-
Architecture Boundary Violations: 19 cmd/ files >100 lines (should be <100)
- Worst:
cmd/debug/iris.go(1507 lines, 15x over limit) - Issue: Business logic in orchestration layer, untestable, unreusable
- Remediation: Refactor to pkg/ following Assess→Intervene→Evaluate pattern (76 hours)
- Worst:
-
fmt.Print Violations (Telemetry Breaking): 298 violations in debug commands
- Issue: Breaks telemetry, forensics, observability
- Rule: CLAUDE.md P0 #1 - NEVER use fmt.Print/Println, ONLY otelzap.Ctx(rc.Ctx)
- Remediation: Convert to structured logging (5 hours, semi-automated)
-
Documentation Policy Violations: 5 forbidden standalone .md files
- Files: P0-1_TOKEN_EXPOSURE_FIX_COMPLETE.md, P0-2_VAULT_SKIP_VERIFY_FIX_COMPLETE.md, P0-3_PRECOMMIT_HOOKS_COMPLETE.md, SECURITY_HARDENING_SESSION_COMPLETE.md, TECHNICAL_SUMMARY_2025-01-28.md
- Remediation: Consolidate to ROADMAP.md + inline comments, delete standalone files (1 hour) ← COMPLETED 2025-11-13
-
Missing Flag Fallback Chain (Human-Centric): Only 5/363 commands use
interaction.GetRequiredString()pattern- Philosophy Violation: "Technology serves humans" - missing flags should prompt with informed consent, not fail
- Remediation: Add fallback chain (CLI flag → env var → prompt → default → error) to required flags (3-4 days)
-
Insecure TLS Configuration: 19 files with
InsecureSkipVerify: true- Attack: MitM via certificate bypass
- Justification Required: Dev-only with clear marking, self-signed certs with pinning, or explicit user consent
- Remediation: Security review + dev/prod split (9.5 hours)
-
Command Injection Risk: 1329 direct
exec.Command()calls bypassingexecute.Runwrapper- Issue: No argument sanitization, timeout enforcement, telemetry integration
- Remediation: Migrate to secure wrapper (44 hours, requires security audit)
Incomplete Infrastructure (Built but Unused):
- Evidence Collection (
pkg/remotedebug/evidence.go): 265 lines, 0 users - Debug Capture (
pkg/debug/capture.go): 151 lines, 1/13 commands using it - Unified Authentik Client: Built, but 47 callsites still use old clients
Technical Debt: 841 TODO/FIXME comments, 18 concentrated in cmd/create/wazuh.go alone
Phase 1: Security Critical (P0) - Week 1-2, 3-4 days
- Flag bypass vulnerability: Protect 357 commands with
ValidateNoFlagLikeArgs()(12h, scriptable) - InsecureSkipVerify audit: Justify or remove 19 violations (9.5h, manual review)
- Documentation policy: Consolidate 5 forbidden .md files to ROADMAP.md (1h) ← COMPLETED
Deliverables:
- All 357 commands protected
- TLS security audit complete
- CVE announcement: "Flag bypass vulnerability patched in eos v1.X"
Phase 2: Compliance & Architecture (P1) - Week 3-4, 7-10 days
- Hardcoded permissions: Automated replacement for 331 production violations ← COMPLETED 2025-11-13 (100% coverage)
- Architecture violations: Refactor 19 oversized cmd/ files to pkg/ (76h, manual)
- fmt.Print violations: Convert to structured logging (5h, semi-automated)
Deliverables:
- Permission security rationale matrix for SOC2 audit
- 100% of cmd/ files <100 lines
- All debug commands use structured logging
Phase 3: Technical Debt Reduction (P2) - Week 5-6, 5-7 days
- Required flag fallback: Add human-centric pattern to top 100 commands (3-4 days)
- Command injection audit: Migrate to execute.Run wrapper, 80%+ coverage (44h audit)
- HTTP client consolidation: Deprecate old Authentik clients, migration guide (2 days)
- Infrastructure adoption: Integrate evidence collection + debug capture (2 days)
Deliverables:
- Top 100 commands have human-centric UX
- exec.Command audit complete
- Authentik unified client migration guide published
Phase 4: Optimization & Polish (P3) - Week 7-8, 3-5 days
- TODO/FIXME cleanup: Triage 841 comments (50% resolve, 25% → issues, 25% document) (2 days)
- Compliance docs: SOC2/PCI-DSS/HIPAA control matrix (1 day)
- AI alignment: Weekly CLAUDE.md review process (1 day)
- Migration tooling:
eos migrate checkfor deprecated patterns (2 days)
Deliverables:
- TODO/FIXME reduced by 75%
- Compliance audit readiness achieved
- Automated pattern migration available
Pre-Remediation (Original State - 2025-11-13):
- Flag bypass: 357/363 commands vulnerable (98.3%)
- Hardcoded permissions:
732 violations→ 0 violations (COMPLETED 2025-11-13) ✅ - Architecture violations: 19 files (6-15x over limit)
- fmt.Print violations: 298
- Human-centric flags: 5/363 commands (1.4%)
Current State (2025-11-13):
- Flag bypass: 357/363 commands vulnerable (98.3%) - IN PROGRESS
- Hardcoded permissions: 0 violations (100% COMPLETE) ✅
- Architecture violations: 19 files (6-15x over limit)
- fmt.Print violations: 298
- Human-centric flags: 5/363 commands (1.4%)
Target State (Post-Remediation):
- Flag bypass: 0 commands vulnerable (100% protected)
- Hardcoded permissions: 0 violations (100% ACHIEVED) ✅
- Architecture violations: 0 files >100 lines (100% refactored)
- fmt.Print violations: Debug commands only (with justification)
- Human-centric flags: Top 100 commands (100% Tier 1)
Timeline: 6-8 weeks for complete remediation with sustained focus
- Enrollment lives at the Authentik brand level; providers expose only
authentication_flow,authorization_flow, andinvalidation_flow. - Current failures:
- Self-registration disabled globally because brand
flow_enrollmentis unset. - BionicGPT bypasses its documented oauth2-proxy pattern, limiting token lifecycle control.
- Self-registration disabled globally because brand
- Response strategy:
- Phase 1: enable brand enrollment, pair with per-app authorization policies, deliver immediate self-service.
- Phase 2: adopt oauth2-proxy + OIDC to match BionicGPT's reference architecture and improve session management.
class Provider(SerializerModel):
authentication_flow = ForeignKey("Flow", ...)
authorization_flow = ForeignKey("Flow", ...)
invalidation_flow = ForeignKey("Flow", ...)
# enrollment_flow = ... # ❌ only available on Source classes- Authentik’s separation of enrollment (brand) vs authorization (application) is intentional.
- BionicGPT documentation:
Nginx → oauth2-proxy → External Identity Provider → Bionic Server.
- Enable enrollment via CLI (Step 1.1, 5 min)
sudo eos update hecate enable self-enrollment --app bionicgpt [--enable-captcha|--dry-run]- Creates flow, prompts, password stage, user creation/login, optional captcha; links to brand and prints enrollment URL.
- Stage order: captcha (optional) → prompt → password → user write → user login.
- Command is idempotent and includes rollback guidance (clear brand enrollment flow).
- Bind per-app authorization policies (Step 1.2, 30–60 min)
- Create
bionicgpt-usersgroup with attributes. - Bind group membership policy to BionicGPT’s application (authorization binding).
- Ensure other apps (Umami/Grafana/Wazuh) rely on admin-only groups.
- Create
- Execute testing matrix (Step 1.3, 15–30 min)
- New user enrollment success, BionicGPT positive access after group assignment, negative tests for restricted apps, idempotency verification.
- Publish documentation (Step 1.4, 30 min)
/opt/hecate/README-enrollment.mdfor end users./opt/hecate/RUNBOOK-enrollment.mdfor administrators (disable/re-enable, monitoring, audit).
- Self-service enrollment live with optional captcha; enrollment URL communicated.
- Authorization policies prevent lateral movement; group assignments gate access.
- Test matrix executed with verified outcomes; documentation in place.
- Existing user flows unaffected; no regressions in SSO behaviour.
- Expectation mismatch: Document clearly that enrollment remains brand-scoped; per-app gating uses policies.
- Spam enrollments: Encourage
--enable-captcha; plan SMTP/email verification follow-up. - Over-engineering Phase 2: Reassess oauth2-proxy migration after 3–6 months of data.
development: Ephemeral, developer-managed, non-federated Authentik, debug logging, disposable state.testing: CI-driven, self-service disabled, debug logging, fixtures regenerated, auto-shutdown every 24 h.staging: Production-parity, gated self-service, info logging, config-only persistence, scheduled shutdown exceptions.production: Always-on with approvals, self-service enabled with audit hooks, persistent replicated volumes.administration: Restricted control plane (Consul/Vault/build tools) with break-glass workflows and audited logging.
- Store defaults per environment in Consul KV and hydrate during
eos promote. - Partition secrets with Vault namespaces or templated paths (
env/<name>/...). - Manage Authentik flows via Outpost/PromptFlow to toggle self-registration per environment.
- Tie promotion provenance to Git SHA and artifact digests.
- Automate DNS via Consul service discovery + external-dns pattern.
- Standardise Consul node metadata (
role,env) and enforce via Nomad scheduling constraints. - Drive log levels from Consul KV to maintain prod quietness vs dev verbosity.
- Default non-prod allocations to
ephemeral_disk; scrub data on teardown. - Enforce 24 h stop windows via Nomad periodic jobs and short-lived Vault tokens.
- Abuse-case catalog and environment policy matrix.
- RFC covering promotion prerequisites and audit log schema updates.
- Inventory of current Consul catalog highlighting worker/edge gaps.
- Risk: ensure admin environment segmentation (Consul ACL bootstrap rotation) precedes automation rollout.
- P1 #5 – Exchange key length: confirm SAML expectations, codify
SAMLExchangeKeyLengthBytesinpkg/wazuh/types.go, regenerate keys accordingly. - P1 #6 – Atomic writes: introduce
pkg/shared/atomic_write.goto guarantee permissions before write; retrofit all five existingos.WriteFileuses. - P1 #7 – Distributed locking: wrap
ConfigureAuthenticationwith Consul-based locks, record KV markerservice/wazuh/sso/configured, validate contention/timeouts. - P1 #8 – URL validation: use
shared.SanitizeURL+shared.ValidateURL, enforce HTTPS and public hostnames, reject localhost/invalid ports with actionable errors. - P1 #9 – Read-only health check: add
GetSAMLProviderByName()/GetApplicationBySlug()helpers so health checks never create resources; surface warnings when drift detected. - P1 #10 – TLS trust posture: add
ServiceOptions.CustomCACert, document preferred--ca-certflag, only fall back to--allow-insecure-tlswith explicit warnings.
Deployment stages:
- Non-breaking updates (key length, atomic writes, validation).
- Behavioural changes (locking, read-only health checks, TLS enhancements). Rollback per item; full build/vet/test suites must pass before promotion.
- Phase 5 – Upgrade & Test
- Bump Vault SDK to v1.22.0; run
go testacrosspkg/secrets,pkg/vault, service packages, and build binaries. - Manual validation:
eos create vault,eos create bionicgpt,eos debug bionicgpt,eos create umami, secrets rotation. - Pass criteria: automated tests green, manual checklist complete, no performance regression.
- Bump Vault SDK to v1.22.0; run
- Phase 5.4 Enhancements
- Add capability verification helpers, context caching, UX-focused error messages, and token rate limiting for
vault_clustercommands (cmd/update/vault_cluster.go,pkg/vault/auth_cluster.go).
- Add capability verification helpers, context caching, UX-focused error messages, and token rate limiting for
- Phase 6 – Documentation & Migration Guide
- Update
CLAUDE.md,CHANGELOG.md,pkg/secrets/README.md. - Publish
docs/SECRET_MANAGEMENT.md(architecture + examples) anddocs/MIGRATION_SECRET_MANAGER.md(step-by-step). - Extend vault cluster documentation with detailed Godoc, UX prompts, troubleshooting, and testing requirements.
- Update
- Phase 1 (Nov 03 → Nov 14): deliver persisted state manager (
internal/service/state.go), lock-file protection, and container/command/variable preflight checks surfaced viaeos service init --dry-run. Include validation-focused unit tests plus operator docs covering the new workflow. - Phase 2 (Nov 17 → Nov 28): implement executor loop with retry/backoff utilities, HTTP healthcheck + API call handlers, and structured logging to
~/.eos/logs/service-<name>.log. Resume semantics should reach parity with scaffolding before December resilience work. - Exit criteria: Langfuse definition can complete dry-run successfully, and CI covers state/preflight paths.
- Risks: Vault ACL alignment for state/log directories and potential scheduling conflicts with Secret Manager Phase 5 testing window.
Context: Adversarial security analysis (2025-01-27) identified 3 CRITICAL, 4 HIGH, 3 MEDIUM vulnerabilities requiring immediate remediation before production deployment.
Compliance Risk: Violates PCI-DSS 3.2.1, SOC2 CC6.1, HIPAA encryption requirements.
- Issue: Vault tokens in
VAULT_TOKEN=<value>visible inps auxe,/proc/<pid>/environ - Location:
pkg/vault/cluster_operations.go(5 functions) - Fix: 2 hours - temporary token files with 0400 perms
- Reference: NIST 800-53 SC-12
- Issue: TLS validation disabled, enables MITM attacks
- Location:
pkg/vault/phase2_env_setup.go:92 - Fix: 3 hours - CA certificate validation with user consent
- Reference: NIST 800-53 SC-8
- Issue: No automated checks prevent regressions
- Fix: 1 hour -
.git/hooks/pre-commit+ CI workflow
- P1-4: HTTP Client Consolidation (Wazuh) - 1 hour
- P1-5: Database Credential Sanitization - 30 min
- P1-6: Hardcoded Permissions Migration - 30 min
- P2-7: Secrets Rotation Framework - 4 weeks
- P2-8: Compliance Documentation - 2 weeks
- P3-9: Security Observability - 2 weeks
- P3-10: Threat Modeling - 1 week
- P3-11: DR Testing Enhancement - Ongoing
- P0 #1: Sanitised runtime export by redacting sensitive env vars via
sanitizeContainerSecrets()(pkg/hecate/authentik/export.go). - P0 #2: Established
UnifiedClientscaffolding (pkg/authentik/unified_client.go) and migration guide (pkg/authentik/MIGRATION.md) for future consolidation. - P1 #3: Added Authentik blueprint export (
pkg/authentik/blueprints.go) alongside existing JSON outputs. - P1 #5: Integrated PostgreSQL backups into export pipeline (
pkg/hecate/authentik/export.go/validation.go).
- P2 #6 – Precipitate function: Decide on API→disk sync approach (recommended: embrace Caddy’s persistence and document template-only stance).
- P2 #7 – OpenAPI client generation: Adopt
oapi-codegen, create wrapper aligning withRuntimeContext, automate schema refresh (weekly GitHub Action), and migrate callers incrementally. - P3 Items (deferred): automation tooling, full migration of
pkg/hecate/authentik/into unified client once wrappers stabilise.
- Container name alignment (
authentik-server),AUTHENTIK_HOSTenv var, Caddy Admin API port binding, UDP/443 exposure, health-check addition. - Validated via fresh VM
eos create hecate.
- Self-service snippet generator.
- Flow slug auto-discovery with pagination/rate limiting.
ServiceOptionsextensions for self-service controls.- Logout URL templating fixes.
- Integration testing in progress.
- Inject self-service handlers into SSO templates, test across multiple services, validate custom flow discovery, run end-to-end enrol/reset/logout flows, and update documentation.
| Phase | Priority | Timeline | Effort | Blocker | Dependencies |
|---|---|---|---|---|---|
| A: Option B (Drift Detection) | P0 | ✅ Complete | 8 h | None | None |
| B.1: Critical Template Fixes | P0 | 2025-11-01 → 2025-11-08 | 4 h | None | None |
| B.2: Self-Service Endpoints | P0 | 2025-11-08 → 2025-11-15 | 8 h | B.1 | Authentik API access |
| B.3: High-Priority Fixes | P1 | Parallel to B.2 | 3 h | None | None |
| C: Precipitate Pattern | P2 | 100 h+ | Converter, comment handling, secrets | None | |
| D: Redis Deprecation | P2 | 2026-02 → 2026-06 | 12 h | None | Eos v2.0 release |
| E: Worker Security Review | P1 | 2026-04 | 16 h | Authentik upstream research | None |
- Implement
eos promote --to testingprofile loader backed by Consul defaults and Vault path rewrites. - Enforce Authentik self-service disabled via API push before Nomad submissions.
- Deploy Nomad periodic job
eos-gc-dev-testingfor 24 h shutdowns with notifications. - Acceptance: CI promotes latest green build with deterministic defaults; rollback validated.
- Enforce node metadata (
roleconstraints) across dev/testing; prohibit persistent volumes via policy pack.
- Monitor Authentik events, gather user feedback, refine policies, log issues for Phase 2 planning.
- Generate OpenAPI client, wrap with Eos conventions, and migrate high-impact callers (Hecate, Wazuh).
- Establish CI workflow for weekly schema diffs; add regression tests ensuring generated structs match live API responses.
- Current state: exports include Authentik secrets redaction, blueprint, Postgres dump; remaining gaps focus on automation and verification.
- Upcoming (Nov–Dec 2025):
- Automate backup scheduling, verification (SHA256 checks), and rotation.
- Document restore runbooks per environment.
- Success metrics: 100% verified backups, documented RTO/RPO, rehearsed restore for at least one production-like workload.
- Finalise guides, run manual migration dry-run using docs, ensure CLAUDE patterns reference new API.
- Phase 3 (Dec 01 → Dec 19): harden executor with idempotent checks, edge-case handlers, and persisted summary output. Introduce Vault write + env update + docker restart step handlers, plus regression tests covering resume and failure paths.
- Phase 4 (Jan 06 → Jan 17): migrate Langfuse bootstrap to the new executor, including integration test harness (
test/integration/langfuse_init.sh) and operator docs. Retire legacy shell script once end-to-end validation completes. - Exit criteria:
eos service init langfusecompletes end-to-end in staging, and roadmap sign-off to decommission ad-hoc scripts. - Risks: coordination with BionicGPT releases for env updates, and ensuring Vault/Consul credentials align with production guardrails.
- Create Authentik OIDC provider for BionicGPT; manage credentials via Vault.
- Deploy oauth2-proxy sidecar (docker-compose) with token refresh validation and header passthrough.
- Update Caddy to route through oauth2-proxy; remove forward-auth configuration, add health checks.
- Execute blue/green migration, run end-to-end/regression/perf testing, and verify rollback plan.
- Update documentation and clean up deprecated file-based routes post-verification.
- Generalise service definitions for Authentik and BionicGPT, building shared step templates where possible.
- Extend executor to support database query handlers and remote state (Vault) options if warranted by production usage.
- Publish operator playbooks and ADR describing declarative service onboarding, and baseline monitoring dashboards for init flows.
- Exit criteria: at least three services running through the framework with integration tests; legacy per-service scripts deprecated.
- Risks: scope creep into full environment automation, ensuring Docs/Support teams trained before retiring old flows.
- Phase 2 (Testing → Staging, 2026-02-01 → 2026-03-15):
- Add evidence collection (integration tests, vuln scans) as promotion prerequisites.
- Require dual approvals (
eos promote approve --require-role) aligned with CLAUDE governance. - Enable staging self-service flows, populate staging DNS via Consul catalog sync, extend 24 h shutdown scheduler with calendar exceptions.
- Highlight drift between node metadata and workloads.
- Phase 3 (Staging → Production, 2026-03-16 → 2026-04-30):
- Enforce change windows (PagerDuty API integration), implement canary/halt rules via Nomad
progress_deadlineand telemetry hooks. - Harden Vault automation (capability checks, admin token caching, rate limiting) per Secret Manager Phase 5.4 outcomes.
- Enforce change windows (PagerDuty API integration), implement canary/halt rules via Nomad
- Goals: encode environment defaults in Consul KV, hydrate Nomad templates, and align Vault secret paths per environment.
- Dependencies: Secret Manager Phase 5/6 completion, environment automation Phase 1 success.
- Milestones: KV schema design, template refactor, Vault namespace/path migration, testing across environments.
- Deliver automated restore validation in staging, integrate into quarterly DR exercises.
- Implement incremental backups, off-site replication, and automated restore drills.
- Target full feature completion by 2026-06-30 with scheduled DR rehearsals.
- P2 items: Admin API rate limiting, DNS validation strictness (
--dev/--prodflags), backup integrity verification,--removeflag implementation. - Q2 backlog: Authentik API circuit breaker, Caddy observability command (
eos read hecate metrics).
- Documented need for automated API→disk sync or official stance on template usage.
- Evaluate Precipitate pattern and CLI UX enhancements once Phase B self-service stabilises.
- Complete migration of remaining callers after OpenAPI client adoption.
- Consider schema-driven policy enforcement and automatic drift detection once wrappers mature.
- Prioritised items for upcoming quarters:
- P1 (Nov 2025): Admin API network segmentation, token discovery simplification.
- P2 (Q1 2026): Backup verification, rate limiting, DNS strictness,
--removeflag. - P3 (Q2 2026): Circuit breakers, metrics/observability.
- Success metrics:
- November 2025: Admin API segmentation + token discovery fix.
- Q1 2026:
--removeflag, verified backups, rate limiting, DNS gating. - Q2 2026: Authentik circuit breaker, Caddy metrics visibility.
- Multiplayer CLI UX improvements triggered by user feedback or Q1 2026 sprint.
- Redis deprecation (P2, 2026-02 → 2026-06) aligned with Eos v2.0.
- Worker security review (P1, 2026-04) dependent on Authentik upstream research.
- User expectation mismatch (Hecate Phase 2): communicate that enrollment remains brand-level; rely on policies for app gating.
- Over-engineering oauth2-proxy: re-evaluate after Phase 1 data; defer if benefits limited.
- Authentik API schema drift: weekly OpenAPI regeneration, automated diff checks.
- Concurrent SSO provisioning: Consul-based locking plus KV markers prevent destructive overlap.
- Vault admin automation: capability verification and token rate limiting reduce blast radius; cache tokens per
RuntimeContext. - Rootless Docker vs permissions: evaluate feasibility, document risk acceptance if unavoidable, require explicit consent during
eos create hecate.
- Self-Enrollment: Eligible services reachable within 60 s of signup; policy violations blocked with clear messaging; <1% enrolment failure rate.
- Secret Manager: All core commands (
eos create,eos debug) succeed with new manager; documentation-guided migration validated by dry-run; zero regressions reported post-upgrade. - Wazuh SSO: No unauthorized access during chaos testing; health checks detect missing resources without side effects; TLS validation supports custom CA without disabling verification.
- Environment Automation: Promotions produce deterministic configs; automated evidence attached to staging promotions; drift detection dashboards show zero critical discrepancies.
- Backup & Restore: 100% of scheduled backups pass verification; at least one quarterly restore exercise completed per environment tier.
- Authentik Client Migration: Generated client passes schema parity tests; wrapper preserves logging/context patterns; migration issues tracked/resolved within sprint.
- Weekly async updates in #eos-infra summarising progress against timeline buckets.
- Anchor documents (
docs/SECRET_MANAGEMENT.md, forthcoming oauth2-proxy migration guide) shared in PR descriptions and linked from README. - For cross-team dependencies (Product, SRE), use
eos promotegovernance hooks (--require-role) and change calendar integrations. - Publish Authentik schema diffs via automated PRs; review cadence weekly.
- Document risk acceptances and mitigation status in CLAUDE.md addenda.
- Primary contact: @henry
- File issues referencing roadmap area tags (e.g.
[auth-phase1],[secret-manager],[wazuh-sso]). - Supporting docs:
docs/SECRET_MANAGER_REFACTORING_PLAN.md, future oauth2-proxy migration runbook.
- Authentik 2025.10 source (
authentik/core/models.py,authentik/providers/oauth2/models.py). - Authentik documentation: https://docs.goauthentik.io/docs/providers/oauth2/
- BionicGPT architecture: https://bionic-gpt.com/docs/running-a-cluster/running-authentication/
- Caddy Admin API docs: https://caddyserver.com/docs/api
- HashiCorp Nomad/Consul/Vault 2024.5 hardening guides.
- CLAUDE.md governance rules and recent adversarial analyses (2025-10-28, 2025-10-31).