Hello Team,
We are currently facing an issue with PingDirectory pod termination in our test environment following the recent upgrade from PD 2307‑9.2.0.1 to 2601‑10.3.0.2.
Post‑upgrade, we observed that each pod takes approximately 5 minutes to terminate, and it requires more than 3 minutes to spin up a new pod. Each pod is consistently force‑terminated at the 5‑minute mark. Our deployment consists of 3 replicas, and all exhibit the same pattern.
On reviewing the pod events, we noticed that each pod begins the termination process but only completes after the full 5‑minute duration. Please note that we also have a utility sidecar running with the main container.
To troubleshoot, we tried the following:
Adjusting readiness and liveness probe settings
Adding a preStop lifecycle hook
Tuning other probe thresholds
However, these changes did not alter the termination duration. The overall cycle is now taking approximately 25 minutes, which is significantly longer than what we previously observed.
Requesting your support in analyzing and resolving this issue.
These are the events we observed when pod is terminating
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=directory-1,app.kubernetes.io/name=pingdirectory
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 4m59s kubelet Stopping container pingdirectory
Normal Killing 4m59s kubelet Stopping container utility-sidecar
Warning Unhealthy 91s (x2 over 3m31s) kubelet Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state
These are the events we observed when pod is getting created right after 5 minutes
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=directory-1,app.kubernetes.io/name=pingdirectory
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3s default-scheduler Successfully assigned ciam-test/directory-1-pingdirectory-2 to ip-10-186-0-123.eu-central-1.compute.internal
PS C:\Users\arun.p-l>
This is our pod definition
global:
annotations:
application_service: "CIAM - test"
spoc: "Marta Miszczyk - mmis@nuuday.dk"
workload:
annotations:
application_service: "CIAM - test"
spoc: "Marta Miszczyk - mmis@nuuday.dk"
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
maxSkew: 1
whenUnsatisfiable: DoNotSchedule
ingress:
addReleaseNameToHost: none
defaultDomain: test2.ciam.non-prod.managed-eks.aws.nuuday.nu
annotations:
nginx.ingress.kubernetes.io/backend-protocol: HTTPS
nginx.ingress.kubernetes.io/proxy-body-size: 10m
cert-manager.io/issuer: letsencrypt
container:
terminationGracePeriodSeconds: 120
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 15"]
probes:
livenessProbe:
exec:
command:
- /opt/liveness.sh
failureThreshold: 4
initialDelaySeconds: 180
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
exec:
command:
- /opt/readiness.sh
failureThreshold: 4
initialDelaySeconds: 180
periodSeconds: 120
successThreshold: 1
timeoutSeconds: 5
startupProbe:
exec:
command:
- /opt/liveness.sh
failureThreshold: 4
initialDelaySeconds: 180
periodSeconds: 60
timeoutSeconds: 10
pingdirectory:
rbac:
serviceAccountName: pd-backup-test
cronjob:
enabled: true
spec:
schedule: "0 1 * * *"
failedJobsHistoryLimit: 1
jobTemplate:
spec:
backoffLimit: 3
template:
spec:
volumes:
- name: backup-script-trigger
configMap:
name: pingdirectory-backup-trigger-1
defaultMode: 0755
restartPolicy: OnFailure
serviceAccountName: pd-backup-test
containers:
- name: pingdirectory-backup-cronjob
image: heyvaldemar/aws-kubectl:latest
command: ["/bin/sh"]
args: ["-c", "/opt/in/backup-trigger.sh"]
volumeMounts:
- name: backup-script-trigger
mountPath: /opt/in/backup-trigger.sh
subPath: backup-trigger.sh
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: 500m
memory: 1Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumes:
- name: temp
emptyDir: {}
- name: backup-script
configMap:
name: pingdirectory-backup
defaultMode: 0755
utilitySidecar:
enabled: true
volumes:
- name: backup-script
mountPath: /opt/in/backup.sh
subPath: backup.sh
- name: pingdirectory-secrets
mountPath: /ciam-secrets/backup.pin
subPath: backup.pin
- name: temp
mountPath: /opt/backup
env:
- name: BUCKET
value: "dk-nuuday-ciam-backup-test"
- name: BACKUP
value: "backup-1"
- name: EXPORTLDIF
value: "export-ldif-1"
image:
repository: 435576480396.dkr.ecr.eu-north-1.amazonaws.com
name: ciam-pd
tag: latest
enabled: true
container:
resources:
requests:
cpu: 3.5
memory: 20Gi
limits:
cpu: 3.5
memory: 20Gi
probes:
readinessProbe:
failureThreshold: 20
periodSeconds: 120
replicaCount: 3
envFrom:
- secretRef:
name: pingdirectory-secrets
optional: false
envs:
ROOT_USER_PASSWORD_FILE: /ciam-secrets/root-user-password
ENCRYPTION_PASSWORD_FILE: /ciam-secrets/encryption-password
ADMIN_USER_PASSWORD_FILE: /ciam-secrets/admin-user-password
USER_BASE_DN: dc=nuuday,dc=dk
MAKELDIF_USERS: "10"
PD_REBUILD_ON_RESTART: "false"
PF_ENGINE_NODE: "federate-engine.test2.ciam.non-prod.managed-eks.aws.nuuday.nu"
MAX_HEAP_SIZE: "14g"
random: ${RANDOM_PLACEHOLDER}
CIAM_ACA_REDIRECT_URI: https://aca.test2.ciam.non-prod.managed-eks.aws.nuuday.nu/invite
secretVolumes:
pingdirectory-secrets:
items:
CIAM_ROOT_USER_PASSWORD: /ciam-secrets/root-user-password
CIAM_ADMIN_USER_PASSWORD: /ciam-secrets/admin-user-password
CIAM_ENCRYPTION_PASSWORD: /ciam-secrets/encryption-password
CIAM_BACKUP_PASSWORD: /ciam-secrets/backup.pin
newping-license:
items:
pd-license-v10.lic: /opt/in/pd.profile/server-root/pre-setup/PingDirectory.lic
ingress:
enabled: true
tls:
- hosts:
- pd.test2.ciam.nuuday.dk
- directory._defaultDomain_
secretName: pf-directory-2-ssl
hosts:
- host: pd.test2.ciam.nuuday.dk
paths:
- path: /scim/v2
pathType: Prefix
backend:
serviceName: https
- path: /directory/v1
pathType: Prefix
backend:
serviceName: https
- host: directory._defaultDomain_
paths:
- path: /scim/v2
pathType: Prefix
backend:
serviceName: https
annotations:
nginx.ingress.kubernetes.io/backend-protocol: HTTPS
nginx.ingress.kubernetes.io/proxy-body-size: 10m
cert-manager.io/issuer: letsencrypt
workload:
type: StatefulSet
statefulSet:
persistentvolume:
volumes:
out-dir:
persistentVolumeClaim:
storageClassName: gp3
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
Hello Team,
We are currently facing an issue with PingDirectory pod termination in our test environment following the recent upgrade from PD 2307‑9.2.0.1 to 2601‑10.3.0.2.
Post‑upgrade, we observed that each pod takes approximately 5 minutes to terminate, and it requires more than 3 minutes to spin up a new pod. Each pod is consistently force‑terminated at the 5‑minute mark. Our deployment consists of 3 replicas, and all exhibit the same pattern.
On reviewing the pod events, we noticed that each pod begins the termination process but only completes after the full 5‑minute duration. Please note that we also have a utility sidecar running with the main container.
To troubleshoot, we tried the following:
Adjusting readiness and liveness probe settings
Adding a preStop lifecycle hook
Tuning other probe thresholds
However, these changes did not alter the termination duration. The overall cycle is now taking approximately 25 minutes, which is significantly longer than what we previously observed.
Requesting your support in analyzing and resolving this issue.
These are the events we observed when pod is terminating
These are the events we observed when pod is getting created right after 5 minutes
This is our pod definition