Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ hostnames
HSIO
HTTPS
iet
imageRegistrySecrets
IfNotPresent
IgnoreDaemonSets
IgnoreNamespaces
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ COPY LICENSE LICENSE
COPY helm-charts-k8s helm-charts-k8s
# need to decompress nfd subchart for k8s chart, in preparation for copying out CRD
RUN cd helm-charts-k8s/charts && \
tar -xvzf node-feature-discovery-chart-0.16.1.tgz
tar -xvzf node-feature-discovery-chart-0.18.3.tgz

ARG TARGET

Expand Down
7 changes: 7 additions & 0 deletions api/v1alpha1/deviceconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -944,6 +944,13 @@ type CommonConfigSpec struct {
// +optional
InitContainerImage string `json:"initContainerImage,omitempty"`

// ImageRegistrySecrets are global secrets used for pull/push images from/to private registries.
// These secrets will be applied to all component pods (device plugin, metrics exporter,
// test runner, config manager, DRA driver, node labeller) in addition to component-specific secrets.
//+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="ImageRegistrySecrets",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:imageRegistrySecrets"}
// +optional
ImageRegistrySecrets []v1.LocalObjectReference `json:"imageRegistrySecrets,omitempty"`

// UtilsContainer contains parameters to configure operator's utils container
//+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="UtilsContainer",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:utilsContainer"}
// +optional
Expand Down
5 changes: 5 additions & 0 deletions api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ spec:
path: commonConfig
x-descriptors:
- urn:alm:descriptor:com.amd.deviceconfigs:commonConfig
- description: ImageRegistrySecrets are global secrets used for pull/push images
from/to private registries. These secrets will be applied to all component
pods (device plugin, metrics exporter, test runner, config manager, DRA
driver, node labeller) in addition to component-specific secrets.
displayName: ImageRegistrySecrets
path: commonConfig.imageRegistrySecrets
x-descriptors:
- urn:alm:descriptor:com.amd.deviceconfigs:imageRegistrySecrets
- description: InitContainerImage is being used for the operands pods, i.e.
metrics exporter, test runner, device plugin, device config manager and
node labeller
Expand Down
22 changes: 22 additions & 0 deletions bundle/manifests/amd.com_deviceconfigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,28 @@ spec:
commonConfig:
description: common config
properties:
imageRegistrySecrets:
description: |-
ImageRegistrySecrets are global secrets used for pull/push images from/to private registries.
These secrets will be applied to all component pods (device plugin, metrics exporter,
test runner, config manager, DRA driver, node labeller) in addition to component-specific secrets.
items:
description: |-
LocalObjectReference contains enough information to let you locate the
referenced object inside the same namespace.
properties:
name:
default: ""
description: |-
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
type: string
type: object
x-kubernetes-map-type: atomic
type: array
initContainerImage:
description: InitContainerImage is being used for the operands
pods, i.e. metrics exporter, test runner, device plugin, device
Expand Down
22 changes: 22 additions & 0 deletions config/crd/bases/amd.com_deviceconfigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,28 @@ spec:
commonConfig:
description: common config
properties:
imageRegistrySecrets:
description: |-
ImageRegistrySecrets are global secrets used for pull/push images from/to private registries.
These secrets will be applied to all component pods (device plugin, metrics exporter,
test runner, config manager, DRA driver, node labeller) in addition to component-specific secrets.
items:
description: |-
LocalObjectReference contains enough information to let you locate the
referenced object inside the same namespace.
properties:
name:
default: ""
description: |-
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
type: string
type: object
x-kubernetes-map-type: atomic
type: array
initContainerImage:
description: InitContainerImage is being used for the operands
pods, i.e. metrics exporter, test runner, device plugin, device
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,14 @@ spec:
path: commonConfig
x-descriptors:
- urn:alm:descriptor:com.amd.deviceconfigs:commonConfig
- description: ImageRegistrySecrets are global secrets used for pull/push images
from/to private registries. These secrets will be applied to all component
pods (device plugin, metrics exporter, test runner, config manager, DRA
driver, node labeller) in addition to component-specific secrets.
displayName: ImageRegistrySecrets
path: commonConfig.imageRegistrySecrets
x-descriptors:
- urn:alm:descriptor:com.amd.deviceconfigs:imageRegistrySecrets
- description: InitContainerImage is being used for the operands pods, i.e.
metrics exporter, test runner, device plugin, device config manager and
node labeller
Expand Down
22 changes: 21 additions & 1 deletion docs/installation/kubernetes-helm.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,26 @@ Installation Options
```{tip}
1. Before v1.3.0 the gpu operator helm chart won't provide a default ```DeviceConfig```, you need to take extra step to create a ```DeviceConfig```.

2. Starting from v1.3.0 the ```helm install``` command would support one-step installation + configuration, which would create a default ```DeviceConfig``` with default values, which may not work for all the users with different the deployment scenarios, please refer to {ref}`typical-deployment-scenarios` for more information and get corresponding ```helm install``` commands.
2. Starting from v1.3.0 the ```helm install``` command would support one-step installation + configuration, which would create a default ```DeviceConfig``` with default values, which may not work for all the users with different the deployment scenarios, please refer to {ref}`typical-deployment-scenarios` for more information and get corresponding ```helm install``` commands.

3. Global Image Pull Secrets (v1.5.0+): If you need to pull images from private registries or avoid Docker Hub rate limits, you can configure global image pull secrets that will be automatically applied to all components:

```bash
# Create your image pull secret first
kubectl create secret docker-registry my-registry-secret \
--docker-server=<your-registry-server> \
--docker-username=<your-username> \
--docker-password=<your-password/token> \
--namespace=kube-amd-gpu

# Install with global secret
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
--version=v1.5.0 \
--set global.imagePullSecrets[0].name=my-registry-secret
```

```

### 3. Helm Chart Customization Parameters
Expand Down Expand Up @@ -171,6 +190,7 @@ The following parameters are able to be configued when using the Helm Chart. In
| controllerManager.manager.resources.requests.memory | string | `"256Mi"` | Memory requests for the controller manager. Adjust based on observed memory usage |
| controllerManager.nodeAffinity.nodeSelectorTerms | list | `[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists"},{"key":"node-role.kubernetes.io/master","operator":"Exists"}]` | Node affinity selector terms config for the AMD GPU operator controller manager, set it to [] if you want to make affinity config empty |
| controllerManager.nodeSelector | object | `{}` | Node selector for AMD GPU operator controller manager deployment |
| global.imagePullSecrets | list | `[]` | Global image pull secret(s) applied to all component pods. Automatically inherited by controller, hooks, DeviceConfig components, and KMM. Format: `[{"name": "mySecret"}]` |
| installdefaultNFDRule | bool | `true` | Set to true to install default NFD rule for detecting AMD GPU hardware based on pci vendor ID and device ID |
| kmm.enabled | bool | `true` | Set to true/false to enable/disable the installation of kernel module management (KMM) operator |
| kmm.watch | bool | `true` | Set to true/false to enable/disable GPU operator watching and using KMM resources |
Expand Down
3 changes: 3 additions & 0 deletions docs/releasenotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
- Supports multiple deployment scenarios: use existing KMM installations (`enabled=false, watch=true`), skip KMM entirely for alternative driver solutions (`enabled=false, watch=false`), or install KMM without asking for GPU Operator to use it (`enabled=true, watch=false`)
- Fully backward compatible: existing configurations with `kmm.enabled=false` continue to work without changes

- **Node Feature Discovery (NFD) Upgrade**
- Upgraded NFD helm chart dependency from v0.16.1 to v0.18.3

## GPU Operator v1.4.1 Release Notes

The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 and Debian 12, and introduces the ability to build `amdgpu` kernel modules directly within air-gapped OpenShift clusters.
Expand Down
6 changes: 3 additions & 3 deletions docs/specialized_networks/airgapped-install.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ docker.io/ubuntu:<Ubuntu OS version>
docker.io/busybox:1.36

# Node Feature Discovery
registry.k8s.io/nfd/node-feature-discovery:v0.16.1
registry.k8s.io/nfd/node-feature-discovery:v0.18.3

# Cert-Manager Images
quay.io/jetstack/cert-manager-controller:v1.15.1
Expand Down Expand Up @@ -84,7 +84,7 @@ INTERNAL_REGISTRY="internal-registry.example.com"
OPERATOR_VERSION="v1.4.1" # GPU operator version, e.g., "v1.5.0"
UBUNTU_VERSION="22.04" # e.g., "22.04"
KANIKO_VERSION="v1.23.2"
NFD_VERSION="v0.16.1"
NFD_VERSION="v0.18.3"
CERT_MANAGER_VERSION="v1.15.1"
BUSYBOX_VERSION="1.36"

Expand Down Expand Up @@ -225,7 +225,7 @@ deviceConfig:
node-feature-discovery:
image:
repository: internal-registry.example.com/nfd/node-feature-discovery
tag: v0.16.1
tag: v0.18.3

# KMM (Kernel Module Management) image configuration
kmm:
Expand Down
2 changes: 1 addition & 1 deletion example/gpu-validation-cluster/build/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ FROM ${BASE_IMAGE}

ARG K3S_VERSION=v1.35.0+k3s1
ARG MULTUS_CNI_VERSION=v4.2.2
ARG NFD_VERSION=v0.16.1
ARG NFD_VERSION=v0.18.3

# Install required system packages
RUN apt-get update && apt-get install -y \
Expand Down
6 changes: 4 additions & 2 deletions hack/k8s-patch/k8s-kmm-patch/metadata-patch/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ controller:
relatedImageWorker: docker.io/rocm/kernel-module-management-worker:latest
# -- Image pull secret name for pulling KMM kaniko builder image if registry needs credential to pull image
relatedImageBuildPullSecret: ""
# -- Image pull secret name for pulling KMM signer image if registry needs credential to pull image
# -- Image pull secret name for pulling KMM signer image if registry needs credential to pull image.
# If not set and global.imagePullSecrets is configured, the first global secret will be used automatically.
relatedImageSignPullSecret: ""
# -- Image pull secret name for pulling KMM worker image if registry needs credential to pull image
# -- Image pull secret name for pulling KMM worker image if registry needs credential to pull image.
# If not set and global.imagePullSecrets is configured, the first global secret will be used automatically.
relatedImageWorkerPullSecret: ""
image:
# -- KMM controller manager image repository
Expand Down
26 changes: 18 additions & 8 deletions hack/k8s-patch/k8s-kmm-patch/template-patch/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,17 @@ spec:
value: {{ quote .Values.controller.manager.env.relatedImageSign }}
- name: KUBERNETES_CLUSTER_DOMAIN
value: {{ quote .Values.kubernetesClusterDomain }}
{{- if .Values.controller.manager.env.relatedImageBuildPullSecret }}
{{- if or .Values.controller.manager.env.relatedImageBuildPullSecret .Values.global.imagePullSecrets }}
- name: RELATED_IMAGE_BUILD_PULL_SECRET
value: {{ .Values.controller.manager.env.relatedImageBuildPullSecret }}
value: {{ .Values.controller.manager.env.relatedImageBuildPullSecret | default (index .Values.global.imagePullSecrets 0).name | default "" }}
{{- end}}
{{- if .Values.controller.manager.env.relatedImageSignPullSecret }}
{{- if or .Values.controller.manager.env.relatedImageSignPullSecret .Values.global.imagePullSecrets }}
- name: RELATED_IMAGE_SIGN_PULL_SECRET
value: {{ .Values.controller.manager.env.relatedImageSignPullSecret }}
value: {{ .Values.controller.manager.env.relatedImageSignPullSecret | default (index .Values.global.imagePullSecrets 0).name | default "" }}
{{- end}}
{{- if .Values.controller.manager.env.relatedImageWorkerPullSecret }}
{{- if or .Values.controller.manager.env.relatedImageWorkerPullSecret .Values.global.imagePullSecrets }}
- name: RELATED_IMAGE_WORKER_PULL_SECRET
value: {{ .Values.controller.manager.env.relatedImageWorkerPullSecret }}
value: {{ .Values.controller.manager.env.relatedImageWorkerPullSecret | default (index .Values.global.imagePullSecrets 0).name | default "" }}
{{- end}}
{{- if .Values.global.proxy.env | default dict}}
{{- range $key, $value := .Values.global.proxy.env }}
Expand Down Expand Up @@ -90,9 +90,14 @@ spec:
- mountPath: /controller_config.yaml
name: manager-config
subPath: controller_config.yaml
{{- if .Values.controller.manager.imagePullSecrets }}
{{- if or .Values.global.imagePullSecrets .Values.controller.manager.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- {{ toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.controller.manager.imagePullSecrets }}
- name: {{ .Values.controller.manager.imagePullSecrets }}
{{- end }}
{{- end}}
securityContext:
runAsNonRoot: true
Expand Down Expand Up @@ -184,9 +189,14 @@ spec:
- mountPath: /controller_config.yaml
name: manager-config
subPath: controller_config.yaml
{{- if .Values.webhookServer.webhookServer.imagePullSecrets }}
{{- if or .Values.global.imagePullSecrets .Values.webhookServer.webhookServer.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- {{ toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.webhookServer.webhookServer.imagePullSecrets }}
- name: {{ .Values.webhookServer.webhookServer.imagePullSecrets }}
{{- end }}
{{- end}}
securityContext:
runAsNonRoot: true
Expand Down
2 changes: 1 addition & 1 deletion hack/k8s-patch/metadata-patch/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ appVersion: "v1.4.0"

dependencies:
- name: node-feature-discovery
version: v0.16.1
version: v0.18.3
repository: "https://kubernetes-sigs.github.io/node-feature-discovery/charts"
condition: node-feature-discovery.enabled
- name: kmm
Expand Down
19 changes: 19 additions & 0 deletions hack/k8s-patch/metadata-patch/values.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# NFD related configs
# schema reference: https://github.com/kubernetes-sigs/node-feature-discovery/blob/release-0.16/deployment/helm/node-feature-discovery/values.yaml
# Note: To use global secrets, set imagePullSecrets of NFD subchart itself, global.imagePullSecrets will not be automatically inherited by NFD subchart.
# Example: node-feature-discovery.imagePullSecrets: [{"name": "my-secret"}]
node-feature-discovery:
# -- Set to true/false to enable/disable the installation of node feature discovery (NFD) operator
enabled: true
Expand All @@ -16,6 +18,9 @@ node-feature-discovery:
# -- Set nodeSelector for NFD worker daemonset
nodeSelector: {}
# KMM related configs
# Note: KMM automatically inherits global.imagePullSecrets. You can override or supplement
# with component-specific secrets using controller.manager.imagePullSecrets and
# webhookServer.webhookServer.imagePullSecrets
kmm:
# -- Set to true/false to enable/disable the installation of kernel module management (KMM) operator subchart
enabled: true
Expand Down Expand Up @@ -372,5 +377,19 @@ utilsContainer:
serviceAccount:
annotations: {}
global:
# -- Global image pull secret(s) applied to all component pods and subcharts.
# If specified, these secrets will be used by:
# - GPU operator controller manager deployment
# - Remediation workflow controller
# - All helm hooks (pre-upgrade, pre-delete, post-delete)
# - DeviceConfig-managed components (via commonConfig.imageRegistrySecrets)
# - KMM controller and webhook pods (automatically inherited)
# - KMM builder/signer/worker pods (automatically uses first secret as fallback)
#
# Format: [{"name": "myGlobalSecret"}] or [{"name": "secret1"}, {"name": "secret2"}]
#
# Note: For NFD subchart, you must manually set the field to match global secrets:
# node-feature-discovery.imagePullSecrets: [{"name": "myGlobalSecret"}]
imagePullSecrets: []
proxy:
env: {}
10 changes: 10 additions & 0 deletions hack/k8s-patch/template-patch/default-deviceconfig.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,16 @@ spec:
initContainerImage: {{ . }}
{{- end }}

{{- if or .imageRegistrySecrets $.Values.global.imagePullSecrets }}
imageRegistrySecrets:
{{- if $.Values.global.imagePullSecrets }}
{{- toYaml $.Values.global.imagePullSecrets | nindent 6 }}
{{- end }}
{{- if .imageRegistrySecrets }}
{{- toYaml .imageRegistrySecrets | nindent 6 }}
{{- end }}
{{- end }}

{{- with .utilsContainer }}
utilsContainer:
{{- with .image }}
Expand Down
7 changes: 6 additions & 1 deletion hack/k8s-patch/template-patch/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,14 @@ spec:
- mountPath: /controller_manager_config.yaml
name: manager-config
subPath: controller_manager_config.yaml
{{- if .Values.controllerManager.manager.imagePullSecrets }}
{{- if or .Values.global.imagePullSecrets .Values.controllerManager.manager.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- {{ toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.controllerManager.manager.imagePullSecrets }}
- name: {{ .Values.controllerManager.manager.imagePullSecrets }}
{{- end }}
{{- end}}
securityContext:
runAsNonRoot: true
Expand Down
7 changes: 6 additions & 1 deletion hack/k8s-patch/template-patch/post-delete-hook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,15 @@ spec:
kubectl delete crds nodemodulesconfigs.kmm.sigs.x-k8s.io
fi
{{- end }}
{{- if .Values.controllerManager.manager.imagePullSecrets }}
{{- if or .Values.global.imagePullSecrets .Values.controllerManager.manager.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- {{ toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.controllerManager.manager.imagePullSecrets }}
- name: {{ .Values.controllerManager.manager.imagePullSecrets }}
{{- end }}
{{- end }}
{{- with .Values.controllerManager.manager.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
Expand Down
Loading
Loading