Skip to content

Issue Observed with PingDirectory Pod Termination After Upgrade #612

@arunplm109083

Description

@arunplm109083

Hello Team,
We are currently facing an issue with PingDirectory pod termination in our test environment following the recent upgrade from PD 2307‑9.2.0.1 to 2601‑10.3.0.2.
Post‑upgrade, we observed that each pod takes approximately 5 minutes to terminate, and it requires more than 3 minutes to spin up a new pod. Each pod is consistently force‑terminated at the 5‑minute mark. Our deployment consists of 3 replicas, and all exhibit the same pattern.
On reviewing the pod events, we noticed that each pod begins the termination process but only completes after the full 5‑minute duration. Please note that we also have a utility sidecar running with the main container.
To troubleshoot, we tried the following:

Adjusting readiness and liveness probe settings
Adding a preStop lifecycle hook
Tuning other probe thresholds

However, these changes did not alter the termination duration. The overall cycle is now taking approximately 25 minutes, which is significantly longer than what we previously observed.
Requesting your support in analyzing and resolving this issue.

These are the events we observed when pod is terminating

Node-Selectors:               <none>

Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s

                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Topology Spread Constraints:  topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=directory-1,app.kubernetes.io/name=pingdirectory

Events:

  Type     Reason     Age                  From     Message

  ----     ------     ----                 ----     -------

  Normal   Killing    4m59s                kubelet  Stopping container pingdirectory

  Normal   Killing    4m59s                kubelet  Stopping container utility-sidecar

  Warning  Unhealthy  91s (x2 over 3m31s)  kubelet  Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state

These are the events we observed when pod is getting created right after 5 minutes

Node-Selectors:               <none>

Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s

                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Topology Spread Constraints:  topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=directory-1,app.kubernetes.io/name=pingdirectory

Events:

  Type     Reason              Age   From                     Message

  ----     ------              ----  ----                     -------

  Normal   Scheduled           3s    default-scheduler        Successfully assigned ciam-test/directory-1-pingdirectory-2 to ip-10-186-0-123.eu-central-1.compute.internal


PS C:\Users\arun.p-l>

This is our pod definition

global:
  annotations:
    application_service: "CIAM - test"
    spoc: "Marta Miszczyk - mmis@nuuday.dk"
  workload:
    annotations:
      application_service: "CIAM - test"
      spoc: "Marta Miszczyk - mmis@nuuday.dk"
    topologySpreadConstraints:
      - topologyKey: topology.kubernetes.io/zone
        maxSkew: 1
        whenUnsatisfiable: DoNotSchedule
  ingress:
    addReleaseNameToHost: none
    defaultDomain: test2.ciam.non-prod.managed-eks.aws.nuuday.nu
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: HTTPS
      nginx.ingress.kubernetes.io/proxy-body-size: 10m
      cert-manager.io/issuer: letsencrypt
  container:
    terminationGracePeriodSeconds: 120
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 15"]
    probes:
      livenessProbe:
        exec:
          command:
            - /opt/liveness.sh
        failureThreshold: 4
        initialDelaySeconds: 180
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 5
      readinessProbe:
        exec:
          command:
            - /opt/readiness.sh
        failureThreshold: 4
        initialDelaySeconds: 180
        periodSeconds: 120
        successThreshold: 1
        timeoutSeconds: 5
      startupProbe:
        exec:
          command:
            - /opt/liveness.sh
        failureThreshold: 4
        initialDelaySeconds: 180
        periodSeconds: 60
        timeoutSeconds: 10
pingdirectory:
  rbac:
    serviceAccountName: pd-backup-test
  cronjob:
    enabled: true
    spec:
      schedule: "0 1 * * *"
      failedJobsHistoryLimit: 1
      jobTemplate:
        spec:
          backoffLimit: 3
          template:
            spec:
              volumes:
                - name: backup-script-trigger
                  configMap:
                    name: pingdirectory-backup-trigger-1
                    defaultMode: 0755
              restartPolicy: OnFailure
              serviceAccountName: pd-backup-test
              containers:
                - name: pingdirectory-backup-cronjob
                  image: heyvaldemar/aws-kubectl:latest
                  command: ["/bin/sh"]
                  args: ["-c", "/opt/in/backup-trigger.sh"]
                  volumeMounts:
                    - name: backup-script-trigger
                      mountPath: /opt/in/backup-trigger.sh
                      subPath: backup-trigger.sh
                  resources:
                    requests:
                      cpu: 250m
                      memory: 1Gi
                    limits:
                      cpu: 500m
                      memory: 1Gi
                  securityContext:
                    allowPrivilegeEscalation: false
                    capabilities:
                      drop:
                      - ALL
  volumes:
    - name: temp
      emptyDir: {}
    - name: backup-script
      configMap:
        name: pingdirectory-backup
        defaultMode: 0755                                                                                                
  utilitySidecar:
    enabled: true
    volumes:
      - name: backup-script
        mountPath: /opt/in/backup.sh
        subPath: backup.sh
      - name: pingdirectory-secrets
        mountPath: /ciam-secrets/backup.pin                                                                                                                                                                                                                                                                                                
        subPath: backup.pin
      - name: temp
        mountPath: /opt/backup
    env:
      - name: BUCKET
        value: "dk-nuuday-ciam-backup-test"
      - name: BACKUP
        value: "backup-1"
      - name: EXPORTLDIF
        value: "export-ldif-1"
  image:
    repository: 435576480396.dkr.ecr.eu-north-1.amazonaws.com
    name: ciam-pd
    tag: latest
  enabled: true
  container:
    resources:
      requests:
        cpu: 3.5
        memory: 20Gi
      limits:
        cpu: 3.5
        memory: 20Gi
    probes:
      readinessProbe:
        failureThreshold: 20
        periodSeconds: 120
    replicaCount: 3
    envFrom:
      - secretRef:
          name: pingdirectory-secrets
          optional: false
  envs:
    ROOT_USER_PASSWORD_FILE: /ciam-secrets/root-user-password
    ENCRYPTION_PASSWORD_FILE: /ciam-secrets/encryption-password
    ADMIN_USER_PASSWORD_FILE: /ciam-secrets/admin-user-password
    USER_BASE_DN: dc=nuuday,dc=dk
    MAKELDIF_USERS: "10"
    PD_REBUILD_ON_RESTART: "false"
    PF_ENGINE_NODE: "federate-engine.test2.ciam.non-prod.managed-eks.aws.nuuday.nu"
    MAX_HEAP_SIZE: "14g"
    random: ${RANDOM_PLACEHOLDER}
    CIAM_ACA_REDIRECT_URI: https://aca.test2.ciam.non-prod.managed-eks.aws.nuuday.nu/invite
  secretVolumes:
    pingdirectory-secrets:
      items:
        CIAM_ROOT_USER_PASSWORD: /ciam-secrets/root-user-password
        CIAM_ADMIN_USER_PASSWORD: /ciam-secrets/admin-user-password
        CIAM_ENCRYPTION_PASSWORD: /ciam-secrets/encryption-password
        CIAM_BACKUP_PASSWORD: /ciam-secrets/backup.pin
    newping-license:
      items:
        pd-license-v10.lic: /opt/in/pd.profile/server-root/pre-setup/PingDirectory.lic
  ingress:
    enabled: true
    tls:
      - hosts:
          - pd.test2.ciam.nuuday.dk
          - directory._defaultDomain_
        secretName: pf-directory-2-ssl
    hosts:
      - host: pd.test2.ciam.nuuday.dk
        paths:
          - path: /scim/v2
            pathType: Prefix
            backend:
              serviceName: https
          - path: /directory/v1
            pathType: Prefix
            backend:
              serviceName: https
      - host: directory._defaultDomain_
        paths:
          - path: /scim/v2
            pathType: Prefix
            backend:
              serviceName: https
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: HTTPS
      nginx.ingress.kubernetes.io/proxy-body-size: 10m
      cert-manager.io/issuer: letsencrypt
  workload:
    type: StatefulSet
    statefulSet:
      persistentvolume:
        volumes:
          out-dir:
            persistentVolumeClaim:
              storageClassName: gp3
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 100Gi

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions