Skip to content

fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452

Merged
tasoo-oos merged 12 commits intomainfrom
fix/infra-dr-gaps
Apr 4, 2026
Merged

fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452
tasoo-oos merged 12 commits intomainfrom
fix/infra-dr-gaps

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Feb 24, 2026

Description

Stage 클러스터 DR(Disaster Recovery) 테스트에서 발견된 인프라 갭을 수정합니다.

Bootstrap 스크립트 수정

  • AWS Secrets Manager 시크릿 이름 통일 (ProdProduction)
  • sealed-secrets deployment/service 이름을 Helm chart release name 기준으로 수정
  • ArgoCD application-controller를 StatefulSet으로 올바르게 참조
  • ArgoCD 초기 Helm 설치 단계 추가 (Application CRD chicken-and-egg 문제 해결)
  • Stage 클러스터용 SKIP_ARGOCD 옵션 추가
  • ArgoCD root app auto-sync 활성화

ArgoCD 관리 밖에 있던 Operator를 선언적 관리로 전환

수동 kubectl apply로만 설치되던 6개 operator에 대해 ArgoCD ApplicationSet 생성:

  • redis-operator (Helm 0.23.0)
  • rabbitmq-cluster-operator (upstream git v2.19.1)
  • rabbitmq-topology-operator (upstream git v1.18.3)
  • minio-operator (upstream kustomize v7.1.1)
  • otel-operator (Helm 0.106.0)
  • reflector (Helm 10.0.10)

배포 순서 보장 (sync-wave)

CRD 제공자가 소비자보다 먼저 배포되도록 sync-wave 계층화:

wave 대상
-3 sealed-secrets, cert-manager
-2 operators (redis, rabbitmq, minio, otel, reflector)
0 ARC controller, k8s-internal, 앱 서비스
1 arc-runner-scale-set

Prometheus persistent storage 추가

  • stage/production 모두 emptyDir → 10Gi local-path PVC로 변경
  • Pod 재시작/DR 시에도 TSDB 메트릭 히스토리 보존

기타

  • ARC에 ServerSideApply=true 추가 (CRD 262KB 초과 문제)
  • github-app-secret SealedSecret을 arc-runner-scale-set multi-source에 포함
  • 사용하지 않는 kubernetes-dashboard Application 삭제 (headlamp으로 대체)

closes TAS-2598

Additional context

DR 테스트에서 추가로 발견된 운영 절차 이슈 (코드 외):

  • ArgoCD API 캐시가 클러스터 리셋 후 stale → server/repo-server/controller 재시작 필요
  • ArgoCD operationState backoff → kubectl patch로 operationState 제거 필요

Before submitting the PR, please make sure you do the following

manamana32321 and others added 7 commits February 25, 2026 02:56
- Rename AWS secret from Codedang-Sealed-Secrets-Prod to
  Codedang-Sealed-Secrets-Production to match bootstrap script's
  ${ENVIRONMENT^} convention (fixes bootstrap failure on production DR)
- Add SKIP_ARGOCD option to bootstrap script for stage clusters
  managed by production ArgoCD
- Fix helm commands missing --kube-context when CLUSTER_CONTEXT is set
- Add DR test plan document

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ler type

- sealed-secrets Helm chart uses release name as deployment/service name
  (`sealed-secrets`), not `sealed-secrets-controller`
- ArgoCD application-controller is a StatefulSet, not a Deployment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Application CRD doesn't exist until ArgoCD is installed, so
kubectl apply -f argocd.yaml would fail on a fresh cluster.
Now bootstraps ArgoCD via Helm first, then applies the self-management
Application for GitOps takeover.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DR-TEST-PLAN.md is for local reference only, not for the repository.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without automated sync, bootstrap requires manual sync trigger
before ArgoCD creates child applications. This blocks full
automated DR recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ArgoCD-managed ApplicationSets for operators that were previously
  installed manually: redis-operator, rabbitmq cluster/topology operators,
  minio-operator, otel-operator, and reflector
- Add ServerSideApply=true to ARC to handle large CRDs (>262KB)
- Add sync-wave '-1' to all CRD providers (operators, cert-manager,
  sealed-secrets) so they deploy before consumers
- Remove unused kubernetes-dashboard Application (replaced by headlamp)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adjust sync-waves: sealed-secrets/cert-manager to -3 (highest
  priority), operators to -2, app services remain at 0 (default)
- Include github-app-secret SealedSecret in arc-runner-scale-set
  Application via multi-source directory include, so DR restores
  ARC runners without manual kubectl apply

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8f0c8b573

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread infra/k8s/argocd/applications/cert-manager.yaml Outdated
manamana32321 and others added 2 commits February 25, 2026 05:23
Upgrade from v2.12.0 to v2.19.1 (latest stable). v2.12.0 is the
official baseline upgrade version, so direct upgrade is supported.

Note: this will cause a rolling update of RabbitMQ StatefulSets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus was using emptyDir, losing all TSDB data on pod restart.
With 90d retention configured but no PVC, metrics history was wiped
on every DR or node restart. Add 50Gi local-path PVC for both
stage and production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sync-wave was on ApplicationSet metadata, but ArgoCD reads it from
the generated Application objects. Move annotations to
spec.template.metadata.annotations so wave ordering actually works.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@manamana32321
Copy link
Copy Markdown
Member Author

@codex do additional review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread infra/k8s/argocd/applications/argocd.yaml
Comment thread infra/bootstrap-cluster.sh Outdated
Copy link
Copy Markdown
Contributor

@tasoo-oos tasoo-oos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tasoo-oos tasoo-oos enabled auto-merge April 4, 2026 19:23
@tasoo-oos tasoo-oos added this pull request to the merge queue Apr 4, 2026
Merged via the queue into main with commit bb78567 Apr 4, 2026
11 checks passed
@tasoo-oos tasoo-oos deleted the fix/infra-dr-gaps branch April 4, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants