diff --git a/enhancements/platform/project-addons/README.md b/enhancements/platform/project-addons/README.md new file mode 100644 index 0000000..d2c622d --- /dev/null +++ b/enhancements/platform/project-addons/README.md @@ -0,0 +1,556 @@ +--- +status: provisional +stage: alpha +latest-milestone: "v0.1" +--- + + + + +# Milo Project Addons + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + +## Summary + + + +When a new `Project` is created today, the controller always ensures two things: +1. A per-project `default` namespace is created. +2. If the Gateway API CRD is installed, a hardcoded `GatewayClass` is created (`datum-external-global-proxy`). + +The first behavior (namespace) is reasonable as a bootstrap step, but the second is tied to a specific environment and not generally applicable. We need a configurable **Project Add-Ons** mechanism so operators can define what defaults should be applied for their projects. This way, environments like Datum Cloud can still install their required `GatewayClass` automatically, while other operators can choose different defaults or none at all. This makes projects more portable, extensible, and reduces assumptions baked into the controller. + +## Motivation + + + +The motivation is to decouple environment-specific resources from the generic project lifecycle. Today, every user of the controller gets the `GatewayClass` created if the CRD exists, even if it’s not needed in their cluster. By introducing a configurable add-ons mechanism, operators can declaratively choose what gets installed by default for new projects, avoiding unnecessary or unwanted resources. + +This also provides a natural extension point for other types of defaults (e.g., network policies, limit ranges, quotas, PodSecurity policies) without hardcoding them into the controller. Over time, all “default project stuff” should move into this mechanism so project bootstrap is consistent and pluggable. + +### Goals + + + +- Provide a way for operators to declare project add-ons (sets of resources/templates). +- Allow default add-ons to be applied automatically when a project is created. +- Make add-ons **opt-in** or **profile-based** so different environments can select what’s right for them. +- Preserve idempotency and clear project status (showing which add-ons are applied and their readiness). +- Replace the hardcoded `GatewayClass` logic with a declarative add-on. + +### Non-Goals + + + +- We are not building a full package manager (e.g., Helm, Argo CD). The scope is only applying pre-defined resources into projects. +- We are not changing the finalizer or purge semantics beyond ensuring add-on resources are cleaned up properly. +- We are not handling CRD installation itself; add-ons will assume their target CRDs already exist. + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +#### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control plane? + - Will enabling / disabling the feature require downtime or reprovisioning of a node? + +#### Does enabling the feature change any default behavior? + + + +#### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +#### What happens if we reenable the feature if it was previously rolled back? + +#### Are there any tests for feature enablement/disablement? + +### Rollout, Upgrade and Rollback Planning + + + +#### How can a rollout or rollback fail? Can it impact already running workloads? + + + +#### What specific metrics should inform a rollback? + + + +#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +#### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +#### How can an operator determine if the feature is in use by workloads? + + + +#### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +#### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +#### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +#### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +#### Will enabling / using this feature result in any new API calls? + + + +#### Will enabling / using this feature result in introducing new API types? + + + +#### Will enabling / using this feature result in any new calls to the cloud provider? + + + +#### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +#### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +#### Will enabling / using this feature result in non-negligible increase of resource usage in any components? + + + +#### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +#### How does this feature react if the API server is unavailable? + +#### What are other known failure modes? + + + +#### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + +