Skip to content

feat(observability): split into composable per-component tasks#28

Open
scotwells wants to merge 3 commits intomainfrom
feat/composable-observability
Open

feat(observability): split into composable per-component tasks#28
scotwells wants to merge 3 commits intomainfrom
feat/composable-observability

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

  • Split the monolithic observability stack into per-component kustomization directories and Taskfile tasks, so downstream repos can install only the components they need (e.g. Victoria Metrics + OTel Collector for resource-metrics e2e tests)
  • Preserved backward compatibility: install-observability is now a thin composite that calls all sub-tasks, producing the same result as before
  • Added webhook retry logic to the OTel Collector task to handle the v1beta1 CRD race condition that causes intermittent failures

New tasks

Task Description Dependencies
install-prometheus-crds Prometheus Operator CRDs -
install-victoria-metrics Victoria Metrics (vmagent + vmsingle) install-prometheus-crds
install-otel-collector OpenTelemetry Operator + Collector -
install-grafana Grafana Operator + instance + datasources install-victoria-metrics
install-loki Loki for log aggregation -
install-tempo Tempo for distributed tracing -
install-observability Full stack (unchanged interface) all of the above

Structure

Each component has its own kustomization.yaml under components/observability/<component>/ that is self-contained (includes namespace + helm-repositories). The root kustomization.yaml references individual files from subdirectories to avoid resource duplication.

components/observability/
├── kustomization.yaml              # root - references all components (backward compat)
├── namespace.yaml                  # shared
├── helm-repositories.yaml          # shared
├── prometheus-crds/
├── victoria-metrics/
├── otel-collector/
├── grafana/
│   └── datasources/
├── loki/
└── tempo/

Test plan

  • kustomize build components/observability produces the same resource set as before
  • kustomize build components/observability/<component> works for each component individually
  • task --list shows all new tasks
  • task install-victoria-metrics install-otel-collector deploys only VM + OTel on a constrained cluster
  • task install-observability still deploys the full stack

🤖 Generated with Claude Code

scotwells and others added 2 commits April 15, 2026 20:54
The monolithic install-observability task deploys the entire telemetry
stack as a single blob, which doesn't fit on resource-constrained CI
runners. Downstream repos like resource-metrics only need Victoria
Metrics + OTel Collector for their e2e tests.

Split the observability stack into per-component kustomization
directories and Taskfile tasks so consumers can install only what they
need. The existing install-observability task is preserved as a thin
composite that calls all sub-tasks, maintaining full backward
compatibility.

New tasks:
- install-prometheus-crds
- install-victoria-metrics (depends on prometheus-crds)
- install-otel-collector (with webhook retry logic)
- install-grafana (depends on victoria-metrics)
- install-loki
- install-tempo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mponent its own namespace

Problem 1: the composable split had copy-pasted namespace.yaml,
helm-repositories.yaml, and datasources/ into each component
subdirectory while leaving the originals at the root. That duplicated
~200 lines and caused the root kustomize build to fail on conflicting
resources.

- Deleted the root-level namespace.yaml, helm-repositories.yaml, and
  datasources/ directory.
- Kept a single copy of the datasources under grafana/datasources/ (where
  the Grafana instance lives).
- Pared each component's helm-repositories.yaml to just the repo that
  component actually consumes. Gave loki/tempo distinct HelmRepository
  names (loki-charts, tempo-charts) so the composed root kustomize build
  does not fail on duplicate source.toolkit.fluxcd.io resources.
- Root components/observability/kustomization.yaml now references the six
  component subdirectories only.
- Dropped prometheus-crds/namespace.yaml entirely — the kustomization
  only installs cluster-scoped CRDs so no namespace is needed.

Problem 2: everything still deployed to telemetry-system, which defeats
the point of per-component composition. Each component now has its own
namespace so kubectl delete ns <x> cleanly uninstalls it:

- victoria-metrics-system (was telemetry-system)
- grafana-system
- loki-system
- tempo-system
- otel-collector-system

Cross-component references are now fully qualified service DNS names:

- Grafana datasources point at vmsingle/vmalertmanager in
  victoria-metrics-system, loki-system-loki in loki-system, and
  tempo-system-tempo in tempo-system.
- The OTel Collector's otlp, loki, and prometheusremotewrite exporters
  point at the new FQDNs.
- VMAlert's datasource, notifier, and remoteWrite URLs use the new
  victoria-metrics-system service names.
- VM defaultDashboards.grafanaOperator.allowCrossNamespaceImport is now
  true so dashboards created in victoria-metrics-system can target the
  Grafana CR in grafana-system.

Taskfile's per-component waits updated to reference the new namespaces
(vmagent/vmsingle in victoria-metrics-system, otel-collector-collector
DaemonSet in otel-collector-system).

README refreshed to document the subcomponent layout, namespaces, and
removal procedure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells
Copy link
Copy Markdown
Contributor Author

Follow-up commit b9515a4 addresses both problems from the review.

Problem 1 — removed duplicated shared files

  • Deleted root-level namespace.yaml, helm-repositories.yaml, and datasources/ (they're now only in the components that own them).
  • datasources/ lives under grafana/datasources/ (the Grafana instance owns the datasource CRs).
  • Each component's helm-repositories.yaml now only contains the single repo it actually consumes. To let the root kustomize build compose everything without duplicate-resource errors, Loki and Tempo use distinct HelmRepository names (loki-charts, tempo-charts) — same URL, different Kubernetes object — and their HRs reference those names.
  • Dropped prometheus-crds/namespace.yaml (CRDs are cluster-scoped; kustomization verified).
  • Root components/observability/kustomization.yaml now references only the six component subdirectories.

Problem 2 — per-component namespaces

Each component deploys to its own namespace so kubectl delete ns <x> cleanly uninstalls it:

Component Namespace
prometheus-crds (cluster-scoped)
victoria-metrics victoria-metrics-system
otel-collector otel-collector-system
loki loki-system
tempo tempo-system
grafana grafana-system

Cross-component references are now FQDNs:

  • Grafana datasources → vmsingle-victoria-metrics-system-vm.victoria-metrics-system.svc.cluster.local:8428, vmalertmanager-…:9093, loki-system-loki.loki-system.svc.cluster.local:3100, tempo-system-tempo.tempo-system.svc.cluster.local:3100.
  • OTel Collector exporters (otlphttp/tempo, prometheusremotewrite, otlphttp/loki) use the new FQDNs.
  • VMAlert's datasource, notifier, remoteWrite URLs updated.
  • defaultDashboards.grafanaOperator.allowCrossNamespaceImport flipped to true so VM-created dashboards can target the Grafana CR in grafana-system.

Taskfile's kubectl wait calls updated for the new namespaces (vmagent/vmsingle in victoria-metrics-system, otel-collector-collector DaemonSet in otel-collector-system). README refreshed.

Verification

  • kustomize build components/observability → OK
  • kustomize build components/observability/<each> → OK for all six

Diff stats

git diff main...HEAD --shortstat is now 31 files changed, 261 insertions(+), 89 deletions(-) — down from ~394 additions. Most of the remaining delta is the Taskfile expansion (necessary for the new composable tasks) and the new README.md section.

The VM HelmRelease had defaultDashboards.grafanaOperator.enabled set to
true, which generates GrafanaDashboard resources. When installed on its
own (e.g. `task install-prometheus-crds install-victoria-metrics`) the
grafana-operator CRDs are not present and Helm fails with
"no matches for kind GrafanaDashboard in version
grafana.integreatly.org/v1beta1".

Flip the default to false so per-component installs succeed, and patch
it back to true in the root observability kustomization so the full
stack continues to ship dashboards via the operator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells
Copy link
Copy Markdown
Contributor Author

Stacked fix in b3453bd: flipped defaultDashboards.grafanaOperator.enabled to false in the VM HelmRelease and added a kustomize patch in the root components/observability to flip it back to true for full-stack installs; standalone install-victoria-metrics no longer requires the grafana-operator CRDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant