feat(observability): split into composable per-component tasks by scotwells · Pull Request #28 · datum-cloud/test-infra

scotwells · 2026-04-16T01:54:41Z

Summary

Split the monolithic observability stack into per-component kustomization directories and Taskfile tasks, so downstream repos can install only the components they need (e.g. Victoria Metrics + OTel Collector for resource-metrics e2e tests)
Preserved backward compatibility: install-observability is now a thin composite that calls all sub-tasks, producing the same result as before
Added webhook retry logic to the OTel Collector task to handle the v1beta1 CRD race condition that causes intermittent failures

New tasks

Task	Description	Dependencies
`install-prometheus-crds`	Prometheus Operator CRDs	-
`install-victoria-metrics`	Victoria Metrics (vmagent + vmsingle)	`install-prometheus-crds`
`install-otel-collector`	OpenTelemetry Operator + Collector	-
`install-grafana`	Grafana Operator + instance + datasources	`install-victoria-metrics`
`install-loki`	Loki for log aggregation	-
`install-tempo`	Tempo for distributed tracing	-
`install-observability`	Full stack (unchanged interface)	all of the above

Structure

Each component has its own kustomization.yaml under components/observability/<component>/ that is self-contained (includes namespace + helm-repositories). The root kustomization.yaml references individual files from subdirectories to avoid resource duplication.

components/observability/
├── kustomization.yaml              # root - references all components (backward compat)
├── namespace.yaml                  # shared
├── helm-repositories.yaml          # shared
├── prometheus-crds/
├── victoria-metrics/
├── otel-collector/
├── grafana/
│   └── datasources/
├── loki/
└── tempo/

Test plan

kustomize build components/observability produces the same resource set as before
kustomize build components/observability/<component> works for each component individually
task --list shows all new tasks
task install-victoria-metrics install-otel-collector deploys only VM + OTel on a constrained cluster
task install-observability still deploys the full stack

🤖 Generated with Claude Code

The monolithic install-observability task deploys the entire telemetry stack as a single blob, which doesn't fit on resource-constrained CI runners. Downstream repos like resource-metrics only need Victoria Metrics + OTel Collector for their e2e tests. Split the observability stack into per-component kustomization directories and Taskfile tasks so consumers can install only what they need. The existing install-observability task is preserved as a thin composite that calls all sub-tasks, maintaining full backward compatibility. New tasks: - install-prometheus-crds - install-victoria-metrics (depends on prometheus-crds) - install-otel-collector (with webhook retry logic) - install-grafana (depends on victoria-metrics) - install-loki - install-tempo Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mponent its own namespace Problem 1: the composable split had copy-pasted namespace.yaml, helm-repositories.yaml, and datasources/ into each component subdirectory while leaving the originals at the root. That duplicated ~200 lines and caused the root kustomize build to fail on conflicting resources. - Deleted the root-level namespace.yaml, helm-repositories.yaml, and datasources/ directory. - Kept a single copy of the datasources under grafana/datasources/ (where the Grafana instance lives). - Pared each component's helm-repositories.yaml to just the repo that component actually consumes. Gave loki/tempo distinct HelmRepository names (loki-charts, tempo-charts) so the composed root kustomize build does not fail on duplicate source.toolkit.fluxcd.io resources. - Root components/observability/kustomization.yaml now references the six component subdirectories only. - Dropped prometheus-crds/namespace.yaml entirely — the kustomization only installs cluster-scoped CRDs so no namespace is needed. Problem 2: everything still deployed to telemetry-system, which defeats the point of per-component composition. Each component now has its own namespace so kubectl delete ns <x> cleanly uninstalls it: - victoria-metrics-system (was telemetry-system) - grafana-system - loki-system - tempo-system - otel-collector-system Cross-component references are now fully qualified service DNS names: - Grafana datasources point at vmsingle/vmalertmanager in victoria-metrics-system, loki-system-loki in loki-system, and tempo-system-tempo in tempo-system. - The OTel Collector's otlp, loki, and prometheusremotewrite exporters point at the new FQDNs. - VMAlert's datasource, notifier, and remoteWrite URLs use the new victoria-metrics-system service names. - VM defaultDashboards.grafanaOperator.allowCrossNamespaceImport is now true so dashboards created in victoria-metrics-system can target the Grafana CR in grafana-system. Taskfile's per-component waits updated to reference the new namespaces (vmagent/vmsingle in victoria-metrics-system, otel-collector-collector DaemonSet in otel-collector-system). README refreshed to document the subcomponent layout, namespaces, and removal procedure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells · 2026-04-16T14:58:46Z

Follow-up commit b9515a4 addresses both problems from the review.

Problem 1 — removed duplicated shared files

Deleted root-level namespace.yaml, helm-repositories.yaml, and datasources/ (they're now only in the components that own them).
datasources/ lives under grafana/datasources/ (the Grafana instance owns the datasource CRs).
Each component's helm-repositories.yaml now only contains the single repo it actually consumes. To let the root kustomize build compose everything without duplicate-resource errors, Loki and Tempo use distinct HelmRepository names (loki-charts, tempo-charts) — same URL, different Kubernetes object — and their HRs reference those names.
Dropped prometheus-crds/namespace.yaml (CRDs are cluster-scoped; kustomization verified).
Root components/observability/kustomization.yaml now references only the six component subdirectories.

Problem 2 — per-component namespaces

Each component deploys to its own namespace so kubectl delete ns <x> cleanly uninstalls it:

Component	Namespace
`prometheus-crds`	(cluster-scoped)
`victoria-metrics`	`victoria-metrics-system`
`otel-collector`	`otel-collector-system`
`loki`	`loki-system`
`tempo`	`tempo-system`
`grafana`	`grafana-system`

Cross-component references are now FQDNs:

Grafana datasources → vmsingle-victoria-metrics-system-vm.victoria-metrics-system.svc.cluster.local:8428, vmalertmanager-…:9093, loki-system-loki.loki-system.svc.cluster.local:3100, tempo-system-tempo.tempo-system.svc.cluster.local:3100.
OTel Collector exporters (otlphttp/tempo, prometheusremotewrite, otlphttp/loki) use the new FQDNs.
VMAlert's datasource, notifier, remoteWrite URLs updated.
defaultDashboards.grafanaOperator.allowCrossNamespaceImport flipped to true so VM-created dashboards can target the Grafana CR in grafana-system.

Taskfile's kubectl wait calls updated for the new namespaces (vmagent/vmsingle in victoria-metrics-system, otel-collector-collector DaemonSet in otel-collector-system). README refreshed.

Verification

kustomize build components/observability → OK
kustomize build components/observability/<each> → OK for all six

Diff stats

git diff main...HEAD --shortstat is now 31 files changed, 261 insertions(+), 89 deletions(-) — down from ~394 additions. Most of the remaining delta is the Taskfile expansion (necessary for the new composable tasks) and the new README.md section.

The VM HelmRelease had defaultDashboards.grafanaOperator.enabled set to true, which generates GrafanaDashboard resources. When installed on its own (e.g. `task install-prometheus-crds install-victoria-metrics`) the grafana-operator CRDs are not present and Helm fails with "no matches for kind GrafanaDashboard in version grafana.integreatly.org/v1beta1". Flip the default to false so per-component installs succeed, and patch it back to true in the root observability kustomization so the full stack continues to ship dashboards via the operator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells · 2026-04-16T15:18:16Z

Stacked fix in b3453bd: flipped defaultDashboards.grafanaOperator.enabled to false in the VM HelmRelease and added a kustomize patch in the root components/observability to flip it back to true for full-stack installs; standalone install-victoria-metrics no longer requires the grafana-operator CRDs.

scotwells and others added 2 commits April 15, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): split into composable per-component tasks#28

feat(observability): split into composable per-component tasks#28
scotwells wants to merge 3 commits intomainfrom
feat/composable-observability

scotwells commented Apr 16, 2026

Uh oh!

scotwells commented Apr 16, 2026

Uh oh!

scotwells commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scotwells commented Apr 16, 2026

Summary

New tasks

Structure

Test plan

Uh oh!

scotwells commented Apr 16, 2026

Problem 1 — removed duplicated shared files

Problem 2 — per-component namespaces

Verification

Diff stats

Uh oh!

scotwells commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant