Skip to content

CORS-4334: Konnectivity#10344

Open
patrickdillon wants to merge 3 commits intoopenshift:mainfrom
patrickdillon:konnectivity
Open

CORS-4334: Konnectivity#10344
patrickdillon wants to merge 3 commits intoopenshift:mainfrom
patrickdillon:konnectivity

Conversation

@patrickdillon
Copy link
Contributor

@patrickdillon patrickdillon commented Mar 2, 2026

Continuation of #10280:

  • Refactored to reduce in-lining in bootkube.sh
  • Added some gating (needs port opening on some or all platforms)

Will break the API vendoring into a separate PR to get that merged sooner rather than later.

Not tested. Opening this now as a /WIP to continue discussion of #10280 with #9628
/cc @JoelSpeed @mdbooth

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026
@openshift-ci openshift-ci bot requested a review from JoelSpeed March 2, 2026 04:08
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 2, 2026

@patrickdillon: This pull request references CORS-4334 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Continuation of #10280:

  • Refactored to reduce in-lining in bootkube.sh
  • Added some gating (needs port opening on some or all platforms)

Will break the API changes into a separate PR.

Not tested. Opening this now as a /WIP to continue discussion of #10280 with #9628
/cc @JoelSpeed @mdbooth

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from mdbooth March 2, 2026 04:08
@patrickdillon
Copy link
Contributor Author

patrickdillon commented Mar 2, 2026

/wip
/hold

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tthvo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 2, 2026

@patrickdillon: This pull request references CORS-4334 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Continuation of #10280:

  • Refactored to reduce in-lining in bootkube.sh
  • Added some gating (needs port opening on some or all platforms)

Will break the API vendoring into a separate PR to get that merged sooner rather than later.

Not tested. Opening this now as a /WIP to continue discussion of #10280 with #9628
/cc @JoelSpeed @mdbooth

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon and others added 3 commits March 2, 2026 17:13
Enables kube-apiserver running on the bootstrap node to access the pod network,
specifically to enable access to webhooks running in the cluster.

Changes:

* Adds a new static Konnectivity server pod running on the bootstrap node
* Configures the bootstrap KAS to use its local Konnectivity server for
outbound cluster traffic
* Add a daemonset deployed into the cluster to run Konnectivity agent on every
cluster node
* Removes daemonset automatically in bootstrap teardown

Co-authored-by: Matthew Booth <mbooth@redhat.com>
Adds error handling to report konnectivity specific failures
when running gather bootstrap or analyze.
This updates all platforms to open the konnectivity port. Baremetal
and on-prem platform have user-provisioned networks, so that
will need be handled up front.
@patrickdillon
Copy link
Contributor Author

/test e2e-vsphere-ovn e2e-nutanix-ovn
/test ?

@patrickdillon
Copy link
Contributor Author

/test e2e-metal-ipi-ovn
/test e2e-agent-compact-ipv4

@patrickdillon
Copy link
Contributor Author

We probably want to clean up the konnectivity ports on bootstrap destroy as well.

@patrickdillon
Copy link
Contributor Author

I have experimented with adding a feature gate to control this and it is possible.

@patrickdillon
Copy link
Contributor Author

Need to not deploy this on a true single node cluster.

@JoelSpeed
Copy link
Contributor

Have read through the changes and the scripts all seem reasonable to me. I'll open a PR to CAPIO that switches us back to Fail webhook policy to test this with

@patrickdillon
Copy link
Contributor Author

/retest-required

2 similar comments
@patrickdillon
Copy link
Contributor Author

/retest-required

@patrickdillon
Copy link
Contributor Author

/retest-required

@tthvo
Copy link
Member

tthvo commented Mar 12, 2026

/cc @sadasu @jhixson74

@openshift-ci openshift-ci bot requested review from jhixson74 and sadasu March 12, 2026 20:47
@tthvo
Copy link
Member

tthvo commented Mar 12, 2026

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-custom-endpoints 836e8d2 link false /test e2e-gcp-custom-endpoints
ci/prow/e2e-aws-ovn-shared-vpc-edge-zones 836e8d2 link false /test e2e-aws-ovn-shared-vpc-edge-zones
ci/prow/e2e-aws-ovn-heterogeneous 836e8d2 link false /test e2e-aws-ovn-heterogeneous
ci/prow/e2e-aws-byo-subnet-role-security-groups 836e8d2 link false /test e2e-aws-byo-subnet-role-security-groups
ci/prow/gcp-custom-endpoints-proxy-wif 836e8d2 link false /test gcp-custom-endpoints-proxy-wif
ci/prow/e2e-openstack-ovn 836e8d2 link true /test e2e-openstack-ovn
ci/prow/e2e-gcp-custom-dns 836e8d2 link false /test e2e-gcp-custom-dns
ci/prow/e2e-openstack-proxy 836e8d2 link false /test e2e-openstack-proxy
ci/prow/e2e-azurestack 836e8d2 link false /test e2e-azurestack
ci/prow/e2e-gcp-xpn-custom-dns 836e8d2 link false /test e2e-gcp-xpn-custom-dns

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool 😎🔥! I just have a questions and comments while learning/reading about this :D

ToPort: 8091,
SourceSecurityGroupRoles: []capa.SecurityGroupRole{"controlplane", "node"},
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add remove this rule when destroying bootstrap, right? This probably means patching the awscluster CR and waiting for the rule to disappear...

💡 Another idea: since this is scoped to only bootstrap node, the installer may pre-create a security group specifically for bootstrap with this rule? This SG can be attached via AdditionalSecurityGroups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I just saw #10344 (comment) so we do need to clean up the rule :D

egressSelections:
- name: "cluster"
connection:
proxyProtocol: "HTTPConnect"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Any reasons to choose HTTPConnect over gRPC, which is should be (theoretically) faster? From docs, it said 👇

# This controls the protocol between the API Server and the Konnectivity
# server. Supported values are "GRPC" and "HTTPConnect". There is no
# end user visible difference between the two modes. You need to set the
# Konnectivity server to work in the same mode.
proxyProtocol: GRPC

My guess is that it's fine to use gRPC? If so, we need to adjust --mode=grpc in konnectivity-server-pod.yaml

Comment on lines +64 to +66
oc delete namespace openshift-bootstrap-konnectivity \
--kubeconfig=/opt/openshift/auth/kubeconfig \
--ignore-not-found=true || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oc delete namespace openshift-bootstrap-konnectivity \
--kubeconfig=/opt/openshift/auth/kubeconfig \
--ignore-not-found=true || true
oc delete namespace openshift-bootstrap-konnectivity \
--kubeconfig=/opt/openshift/auth/kubeconfig \
--ignore-not-found=true

I guess we should fail if the cleanup somehow failed (except not-found) right? Otherwise, resources will be left behind and can potentially "break" the end openshift cluster?

Comment on lines +10 to +14
{{- if .UseIPv6ForNodeIP }}
BOOTSTRAP_NODE_IP=$(ip -6 -j route get 2001:4860:4860::8888 | jq -r '.[0].prefsrc')
{{- else }}
BOOTSTRAP_NODE_IP=$(ip -j route get 1.1.1.1 | jq -r '.[0].prefsrc')
{{- end }}
Copy link
Member

@tthvo tthvo Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also honour the field .BootstrapNodeIP if set via the environment variable OPENSHIFT_INSTALL_BOOTSTRAP_NODE_IP, right?

Tracing back to the commit, it may be necessary for assisted installer 🤔?

bootstrapNodeIP := os.Getenv("OPENSHIFT_INSTALL_BOOTSTRAP_NODE_IP")
if bootstrapNodeIP != "" && net.ParseIP(bootstrapNodeIP) == nil {
logrus.Warnf("OPENSHIFT_INSTALL_BOOTSTRAP_NODE_IP must have valid ip address, given %s. Skipping it", bootstrapNodeIP)
bootstrapNodeIP = ""
}

BootstrapNodeIP string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we can do something like 👇 WDYT?

{{- if .BootstrapNodeIP }}                                                                                                                                                 
      # Use explicitly configured bootstrap node IP                                                                                                                          
      BOOTSTRAP_NODE_IP="{{.BootstrapNodeIP}}"
      echo "Using configured bootstrap node IP: ${BOOTSTRAP_NODE_IP}"
{{- else }}
      # Detect bootstrap node IP at runtime using the default route source address.
      # Konnectivity agents use this to connect back to the bootstrap server.
  {{- if .UseIPv6ForNodeIP }}
      BOOTSTRAP_NODE_IP=$(ip -6 -j route get 2001:4860:4860::8888 | jq -r '.[0].prefsrc')
  {{- else }}
      BOOTSTRAP_NODE_IP=$(ip -j route get 1.1.1.1 | jq -r '.[0].prefsrc')
  {{- end }}
      echo "Detected bootstrap node IP: ${BOOTSTRAP_NODE_IP}"
{{- end }}

Comment on lines +27 to +31
containers:
- name: konnectivity-agent
image: ${KONNECTIVITY_IMAGE}
command:
- /usr/bin/proxy-agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should give this agent container a resource request so that it won't be the first get evicted if node is under pressure (theoretically).

As reference, Hypershift sets the following values 👀

Comment on lines +12 to +15
- name: konnectivity-server
image: ${KONNECTIVITY_IMAGE}
command:
- /usr/bin/proxy-server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should give this server container a resource request so that it won't be the first get evicted if node is under pressure (theoretically).

As reference, Hypershift sets the following values 👀

Comment on lines +13 to +16
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 10%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these pods only run during bootstrap and will "never?" get updated so we can just ignore this setting, right 🤔?

Besides, I guess 10% of 3 control plane node is ~ 1 node; thus, it is equivalent to maxUnavailable: 1, which is already the default that k8s set (according to docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants