fix: resolve first-time deployment reliability issues#53
fix: resolve first-time deployment reliability issues#53agullon wants to merge 6 commits intoopenshift-eng:mainfrom
Conversation
|
@agullon: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: agullon The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ddab363 to
ae79ba7
Compare
The 'timeout' command is not available on macOS by default, causing 'make deploy' to fail during instance creation. Remove the dependency by relying on the AWS CLI built-in waiter which already polls every 15s for up to 40 attempts (~10 minutes) by default. pre-commit.check-secrets: ENABLED
Replace bash 4+ syntax (${var,,}) with portable 'tr' alternative.
macOS ships with bash 3.2 which does not support this syntax,
causing 'make redeploy-cluster' to fail with 'bad substitution'.
pre-commit.check-secrets: ENABLED
On AWS RHUI-managed instances, 'subscription-manager repos --enable codeready-builder-*' fails silently because repos are managed by RHUI configuration, not subscription-manager. This causes libvirt-devel to be unavailable, breaking the dev-scripts requirements installation. Use '/usr/bin/crb enable' which handles both RHUI and non-RHUI environments correctly. pre-commit.check-secrets: ENABLED
The 'Set OCP project' handler fires even when the cluster deployment fails, producing a confusing secondary error about missing 'oc' or kubeconfig that masks the actual failure. Add failed_when: false so the handler does not error when kubeconfig does not exist. pre-commit.check-secrets: ENABLED
When the config uses a CI registry image (registry.ci.openshift.org) but the pull secret lacks CI credentials, the deployment runs for ~20 minutes before failing with an unclear 'unauthorized' error. Add an early check that fails immediately with a clear message explaining how to obtain CI registry credentials. pre-commit.check-secrets: ENABLED
ae79ba7 to
3eeb854
Compare
…aiter" This reverts commit 328d618.
fonta-rh
left a comment
There was a problem hiding this comment.
This are some great additions to make it smoother, specially the CI credentials check! I've left some comments on minor improvements for readability and maintanability
| - name: Set OCP project | ||
| command: oc --kubeconfig="{{kubeconfig_path}}" project openshift-machine-api | ||
| command: oc --kubeconfig="{{ kubeconfig_path }}" project openshift-machine-api | ||
| when: kubeconfig_path is defined |
There was a problem hiding this comment.
Kubeconfig path is always defined (it's on vars/main.yml), but the intention is good. We should check for file existence, not variable definition. I would either use a "stat" to check for the file or add a debug task to tell the user that the OCP project could not be set
| sudo /usr/bin/crb enable | ||
| else | ||
| echo "WARNING: 'crb' command not found, attempting subscription-manager fallback" | ||
| sudo subscription-manager repos --enable "codeready-builder-for-rhel-9-$(uname -m)-rpms" || true |
There was a problem hiding this comment.
This is a great idea, but the || true at the end can cause silent failures, which will cause hard-to-explain issues down the line. I would fail explicitly on the fallback
if ! sudo subscription-manager repos --enable "codeready-builder-for-rhel-9-$(uname -m)-rpms"; then echo "ERROR: Failed to enable CRB repository. libvirt-devel will be unavailable." exit 1 fi
|
|
||
| - name: Read pull secret to check for CI registry auth | ||
| set_fact: | ||
| pull_secret_content: "{{ lookup('file', 'pull-secret.json') | from_json }}" |
There was a problem hiding this comment.
NIT: If the json is malformed, we will sadly skip the warning below. Might want to wrap it in a block/rescue
| # On RHUI instances (like AWS), subscription-manager repos --enable doesn't work | ||
| # for CRB because repos are managed by RHUI configuration. The 'crb' command | ||
| # handles both RHUI and non-RHUI environments correctly. | ||
| if command -v crb &>/dev/null; then |
There was a problem hiding this comment.
I'm not sure why we're using "crb" here but "/usr/bin/crb" later. I would change the later to just "crb"
| ansible-playbook redeploy.yml -i inventory.ini \ | ||
| --extra-vars "topology=${topology}" \ | ||
| --extra-vars "method=${current_installation_method,,}" \ | ||
| --extra-vars "method=$(echo "${current_installation_method}" | tr '[:upper:]' '[:lower:]')" \ |
There was a problem hiding this comment.
This is a great addition. I would add a comment to explain it, though
# Uses tr instead of ${var,,} for bash 3.2 (macOS) compatibility.
Summary
Addresses multiple issues encountered during first-time deployment that cause failures
at various stages, often with unclear error messages.
Changes
a5dd376 - Fix bash 4+ syntax for macOS compatibility (
redeploy-cluster.sh): Replace${var,,}(bash 4+) with portabletr '[:upper:]' '[:lower:]'since macOS shipswith bash 3.2.
b458662 - Enable CRB repository on AWS RHUI instances (
configure.sh): On AWS RHUI-managedinstances,
subscription-manager repos --enable codeready-builder-*fails silentlybecause repos are managed by RHUI configuration. This causes
libvirt-develto beunavailable, breaking the dev-scripts requirements installation. The fix uses
/usr/bin/crb enablewhich works on both RHUI and non-RHUI environments.30a1f81 - Make handler resilient to missing kubeconfig (
handlers/main.yml): The "Set OCPproject" handler fires even when the cluster deployment fails, producing a confusing
secondary error about missing
ocor kubeconfig that masks the actual failure.3eeb854 - Add CI registry pull secret pre-flight validation (
config.yml): When the configuses a CI registry image (
registry.ci.openshift.org) but the pull secret lacks CIcredentials, the deployment fails ~20 minutes in with an unclear "unauthorized" error.
This adds an early check that fails immediately with a clear message explaining how
to obtain CI registry credentials.
Test plan
make deploy arbiter-ipi(verifies bash compatibility fix)