Skip to content

fix: resolve first-time deployment reliability issues#53

Open
agullon wants to merge 6 commits intoopenshift-eng:mainfrom
agullon:NO-JIRA-fix-first-time-deployment-issues
Open

fix: resolve first-time deployment reliability issues#53
agullon wants to merge 6 commits intoopenshift-eng:mainfrom
agullon:NO-JIRA-fix-first-time-deployment-issues

Conversation

@agullon
Copy link

@agullon agullon commented Mar 6, 2026

Summary

Addresses multiple issues encountered during first-time deployment that cause failures
at various stages, often with unclear error messages.

Changes

  • a5dd376 - Fix bash 4+ syntax for macOS compatibility (redeploy-cluster.sh): Replace
    ${var,,} (bash 4+) with portable tr '[:upper:]' '[:lower:]' since macOS ships
    with bash 3.2.

  • b458662 - Enable CRB repository on AWS RHUI instances (configure.sh): On AWS RHUI-managed
    instances, subscription-manager repos --enable codeready-builder-* fails silently
    because repos are managed by RHUI configuration. This causes libvirt-devel to be
    unavailable, breaking the dev-scripts requirements installation. The fix uses
    /usr/bin/crb enable which works on both RHUI and non-RHUI environments.

  • 30a1f81 - Make handler resilient to missing kubeconfig (handlers/main.yml): The "Set OCP
    project" handler fires even when the cluster deployment fails, producing a confusing
    secondary error about missing oc or kubeconfig that masks the actual failure.

  • 3eeb854 - Add CI registry pull secret pre-flight validation (config.yml): When the config
    uses a CI registry image (registry.ci.openshift.org) but the pull secret lacks CI
    credentials, the deployment fails ~20 minutes in with an unclear "unauthorized" error.
    This adds an early check that fails immediately with a clear message explaining how
    to obtain CI registry credentials.

Test plan

  • Deploy from macOS using make deploy arbiter-ipi (verifies bash compatibility fix)
  • Deploy on a fresh AWS RHUI instance without CRB pre-enabled (verifies CRB fix)
  • Deploy with CI registry image but without CI credentials in pull secret (verifies early validation)
  • Trigger a deployment failure and verify the handler doesn't produce secondary errors

@openshift-ci-robot
Copy link

@agullon: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

Addresses multiple issues encountered during first-time deployment that cause failures
at various stages, often with unclear error messages.

Changes

  • Replace timeout with AWS CLI --waiter-config (create.sh): The timeout command
    is not available on macOS by default. Instead of requiring users to install coreutils,
    use the AWS CLI built-in waiter configuration to explicitly set the 10-minute timeout
    (40 attempts × 15s intervals).

  • Fix bash 4+ syntax for macOS compatibility (redeploy-cluster.sh): Replace
    ${var,,} (bash 4+) with portable tr '[:upper:]' '[:lower:]' since macOS ships
    with bash 3.2.

  • Enable CRB repository on AWS RHUI instances (configure.sh): On AWS RHUI-managed
    instances, subscription-manager repos --enable codeready-builder-* fails silently
    because repos are managed by RHUI configuration. This causes libvirt-devel to be
    unavailable, breaking the dev-scripts requirements installation. The fix uses
    /usr/bin/crb enable which works on both RHUI and non-RHUI environments.

  • Add CI registry pull secret pre-flight validation (config.yml): When the config
    uses a CI registry image (registry.ci.openshift.org) but the pull secret lacks CI
    credentials, the deployment fails ~20 minutes in with an unclear "unauthorized" error.
    This adds an early check that fails immediately with a clear message explaining how
    to obtain CI registry credentials.

  • Make handler resilient to missing kubeconfig (handlers/main.yml): The "Set OCP
    project" handler fires even when the cluster deployment fails, producing a confusing
    secondary error about missing oc or kubeconfig that masks the actual failure.

Test plan

  • Deploy from macOS using make deploy arbiter-ipi (verifies timeout and bash fixes)
  • Deploy on a fresh AWS RHUI instance without CRB pre-enabled (verifies CRB fix)
  • Deploy with CI registry image but without CI credentials in pull secret (verifies early validation)
  • Trigger a deployment failure and verify the handler doesn't produce secondary errors

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 6, 2026
@openshift-ci openshift-ci bot requested review from jaypoulz and slintes March 6, 2026 10:26
@openshift-ci
Copy link

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: agullon
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agullon agullon force-pushed the NO-JIRA-fix-first-time-deployment-issues branch from ddab363 to ae79ba7 Compare March 6, 2026 10:34
@agullon agullon marked this pull request as draft March 6, 2026 10:34
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026
@agullon agullon marked this pull request as ready for review March 10, 2026 10:09
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
@openshift-ci openshift-ci bot requested a review from eggfoobar March 10, 2026 10:10
@agullon agullon changed the title NO-JIRA: fix: resolve first-time deployment reliability issues fix: resolve first-time deployment reliability issues Mar 10, 2026
agullon added 5 commits March 10, 2026 11:35
The 'timeout' command is not available on macOS by default, causing
'make deploy' to fail during instance creation. Remove the dependency
by relying on the AWS CLI built-in waiter which already polls every
15s for up to 40 attempts (~10 minutes) by default.

pre-commit.check-secrets: ENABLED
Replace bash 4+ syntax (${var,,}) with portable 'tr' alternative.
macOS ships with bash 3.2 which does not support this syntax,
causing 'make redeploy-cluster' to fail with 'bad substitution'.

pre-commit.check-secrets: ENABLED
On AWS RHUI-managed instances, 'subscription-manager repos --enable
codeready-builder-*' fails silently because repos are managed by
RHUI configuration, not subscription-manager. This causes
libvirt-devel to be unavailable, breaking the dev-scripts
requirements installation. Use '/usr/bin/crb enable' which handles
both RHUI and non-RHUI environments correctly.

pre-commit.check-secrets: ENABLED
The 'Set OCP project' handler fires even when the cluster deployment
fails, producing a confusing secondary error about missing 'oc' or
kubeconfig that masks the actual failure. Add failed_when: false so
the handler does not error when kubeconfig does not exist.

pre-commit.check-secrets: ENABLED
When the config uses a CI registry image (registry.ci.openshift.org)
but the pull secret lacks CI credentials, the deployment runs for
~20 minutes before failing with an unclear 'unauthorized' error.
Add an early check that fails immediately with a clear message
explaining how to obtain CI registry credentials.

pre-commit.check-secrets: ENABLED
@agullon agullon force-pushed the NO-JIRA-fix-first-time-deployment-issues branch from ae79ba7 to 3eeb854 Compare March 10, 2026 10:36
Copy link
Contributor

@fonta-rh fonta-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are some great additions to make it smoother, specially the CI credentials check! I've left some comments on minor improvements for readability and maintanability

- name: Set OCP project
command: oc --kubeconfig="{{kubeconfig_path}}" project openshift-machine-api
command: oc --kubeconfig="{{ kubeconfig_path }}" project openshift-machine-api
when: kubeconfig_path is defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubeconfig path is always defined (it's on vars/main.yml), but the intention is good. We should check for file existence, not variable definition. I would either use a "stat" to check for the file or add a debug task to tell the user that the OCP project could not be set

sudo /usr/bin/crb enable
else
echo "WARNING: 'crb' command not found, attempting subscription-manager fallback"
sudo subscription-manager repos --enable "codeready-builder-for-rhel-9-$(uname -m)-rpms" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, but the || true at the end can cause silent failures, which will cause hard-to-explain issues down the line. I would fail explicitly on the fallback
if ! sudo subscription-manager repos --enable "codeready-builder-for-rhel-9-$(uname -m)-rpms"; then echo "ERROR: Failed to enable CRB repository. libvirt-devel will be unavailable." exit 1 fi


- name: Read pull secret to check for CI registry auth
set_fact:
pull_secret_content: "{{ lookup('file', 'pull-secret.json') | from_json }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: If the json is malformed, we will sadly skip the warning below. Might want to wrap it in a block/rescue

# On RHUI instances (like AWS), subscription-manager repos --enable doesn't work
# for CRB because repos are managed by RHUI configuration. The 'crb' command
# handles both RHUI and non-RHUI environments correctly.
if command -v crb &>/dev/null; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we're using "crb" here but "/usr/bin/crb" later. I would change the later to just "crb"

ansible-playbook redeploy.yml -i inventory.ini \
--extra-vars "topology=${topology}" \
--extra-vars "method=${current_installation_method,,}" \
--extra-vars "method=$(echo "${current_installation_method}" | tr '[:upper:]' '[:lower:]')" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition. I would add a comment to explain it, though
# Uses tr instead of ${var,,} for bash 3.2 (macOS) compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants