fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452
Merged
fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452
Conversation
- Rename AWS secret from Codedang-Sealed-Secrets-Prod to
Codedang-Sealed-Secrets-Production to match bootstrap script's
${ENVIRONMENT^} convention (fixes bootstrap failure on production DR)
- Add SKIP_ARGOCD option to bootstrap script for stage clusters
managed by production ArgoCD
- Fix helm commands missing --kube-context when CLUSTER_CONTEXT is set
- Add DR test plan document
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ler type - sealed-secrets Helm chart uses release name as deployment/service name (`sealed-secrets`), not `sealed-secrets-controller` - ArgoCD application-controller is a StatefulSet, not a Deployment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Application CRD doesn't exist until ArgoCD is installed, so kubectl apply -f argocd.yaml would fail on a fresh cluster. Now bootstraps ArgoCD via Helm first, then applies the self-management Application for GitOps takeover. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DR-TEST-PLAN.md is for local reference only, not for the repository. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without automated sync, bootstrap requires manual sync trigger before ArgoCD creates child applications. This blocks full automated DR recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ArgoCD-managed ApplicationSets for operators that were previously installed manually: redis-operator, rabbitmq cluster/topology operators, minio-operator, otel-operator, and reflector - Add ServerSideApply=true to ARC to handle large CRDs (>262KB) - Add sync-wave '-1' to all CRD providers (operators, cert-manager, sealed-secrets) so they deploy before consumers - Remove unused kubernetes-dashboard Application (replaced by headlamp) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adjust sync-waves: sealed-secrets/cert-manager to -3 (highest priority), operators to -2, app services remain at 0 (default) - Include github-app-secret SealedSecret in arc-runner-scale-set Application via multi-source directory include, so DR restores ARC runners without manual kubectl apply Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8f0c8b573
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Upgrade from v2.12.0 to v2.19.1 (latest stable). v2.12.0 is the official baseline upgrade version, so direct upgrade is supported. Note: this will cause a rolling update of RabbitMQ StatefulSets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus was using emptyDir, losing all TSDB data on pod restart. With 90d retention configured but no PVC, metrics history was wiped on every DR or node restart. Add 50Gi local-path PVC for both stage and production. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6ce02eb to
bdb61f8
Compare
sync-wave was on ApplicationSet metadata, but ArgoCD reads it from the generated Application objects. Move annotations to spec.template.metadata.annotations so wave ordering actually works. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Member
Author
|
@codex do additional review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
tasoo-oos
reviewed
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Stage 클러스터 DR(Disaster Recovery) 테스트에서 발견된 인프라 갭을 수정합니다.
Bootstrap 스크립트 수정
Prod→Production)sealed-secretsdeployment/service 이름을 Helm chart release name 기준으로 수정application-controller를 StatefulSet으로 올바르게 참조SKIP_ARGOCD옵션 추가ArgoCD 관리 밖에 있던 Operator를 선언적 관리로 전환
수동
kubectl apply로만 설치되던 6개 operator에 대해 ArgoCD ApplicationSet 생성:배포 순서 보장 (sync-wave)
CRD 제공자가 소비자보다 먼저 배포되도록 sync-wave 계층화:
Prometheus persistent storage 추가
emptyDir→ 10Gilocal-pathPVC로 변경기타
ServerSideApply=true추가 (CRD 262KB 초과 문제)github-app-secretSealedSecret을 arc-runner-scale-set multi-source에 포함kubernetes-dashboardApplication 삭제 (headlamp으로 대체)closes TAS-2598
Additional context
DR 테스트에서 추가로 발견된 운영 절차 이슈 (코드 외):
kubectl patch로 operationState 제거 필요Before submitting the PR, please make sure you do the following
fixes #123).