Skip to content

postStart hook commands timeout#1440

Merged
akurinnoy merged 22 commits into
devfile:mainfrom
akurinnoy:postStartHookTimeout
Aug 15, 2025
Merged

postStart hook commands timeout#1440
akurinnoy merged 22 commits into
devfile:mainfrom
akurinnoy:postStartHookTimeout

Conversation

@akurinnoy
Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR addresses the issue of postStart hook failures in DevWorkspaces when hook commands not exiting within the timeout period, so that the workspace pod gets stuck in Terminating state and never gets deleted.

This PR resolves the issue by:

  • Introducing timeout for postStart hook. User-provided commands are now wrapped with the timeout utility. This ensures that postStart hook commands are terminated if they exceed a configurable duration. The timeout duration can be set in the DevWorkspaceOperatorConfig (a value of 0 means no timeout):
    # DevWorkspaceOperatorConfig
    # ...
    config:
      workspace:
        postStartTimeout: 30 # Timeout in seconds
  • Adding the parsing logic for interpreting various Kubelet messages to extract an exact reason or exit code for lifecycle hook failures.

What issues does this PR fix or reference?

https://issues.redhat.com/browse/CRW-8329

Is it tested? How?

  1. Install DWO from this PR:
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: devworkspace-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/okurinny/devworkspace-operator-index:postStartHookTimeout
  publisher: Red Hat
  displayName: DevWorkspace Operator Catalog
  updateStrategy:
    registryPoll:
      interval: 5m
EOF
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: devworkspace-operator
  namespace: openshift-operators
spec:
  channel: next
  installPlanApproval: Automatic
  name: devworkspace-operator
  source: devworkspace-operator-catalog
  sourceNamespace: openshift-marketplace
EOF
  1. Create DevWorkspaceOperatorConfig with the postStart hook timeout duration (in seconds):
oc apply -f - <<EOF
apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
  name: devworkspace-operator-config
  namespace: openshift-operators
config:
  workspace:
    postStartTimeout: 30
EOF
  1. Create a problematic DevWorkspace designed to have its postStart hook time out:
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: problematic-workspace
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/devfile/universal-developer-image:ubi9-latest
          memoryLimit: "1Gi"
          memoryRequest: "512Mi"
          cpuRequest: "250m"
          cpuLimit: "1000m"
    commands:
      - id: sleep-infinity-cmd
        exec:
          component: tools
          commandLine: "echo 'PostStart: Starting infinite sleep...'; sleep infinity; echo 'PostStart: Sleep finished (should not be reached)'"
    events:
      postStart:
        - sleep-infinity-cmd
EOF
  1. Watch the DevWorkspace:
oc get dw problematic-workspace -w
  1. The DevWorkspace should eventually enter a Failed phase.
  2. The status.message of the DevWorkspace should provide a reason for the failure, indicating a timeout. For example: Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands terminated by SIGTERM (likely timed out after 30s). Exit code 143.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@akurinnoy akurinnoy self-assigned this May 29, 2025
@akurinnoy akurinnoy requested review from dkwon17 and ibuziuk as code owners May 29, 2025 13:06
@akurinnoy akurinnoy requested a review from rohanKanojia May 29, 2025 13:07
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from e90b773 to c342798 Compare May 29, 2025 13:56
Comment thread pkg/library/status/check.go Outdated
@rohanKanojia
Copy link
Copy Markdown
Member

I tried the abovementioned steps and I was able to see probelematic workspace failing with [postStart hook] message:

oc get pods -w
NAME                                               READY   STATUS              RESTARTS   AGE
devworkspace-controller-manager-6c948bbf56-k6262   2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-kglmf       2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-m9rc5       2/2     Running             0          32m
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          6s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          9s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     PostStartHookError   0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          1 (14s ago)   28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s

oc get dw
NAME                    DEVWORKSPACE ID             PHASE    INFO
problematic-workspace   workspace35712747d3d64d73   Failed   Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands failed (Kubelet reported exit code 1)

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akurinnoy, rohanKanojia
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread pkg/library/lifecycle/poststart.go Outdated
akurinnoy added 13 commits July 10, 2025 15:05
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…rate_all

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…ds generate_all

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
… commands

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…rt hook commands

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 61d8918 to 85046e5 Compare July 14, 2025 12:54
@openshift-ci openshift-ci Bot removed the lgtm label Jul 14, 2025
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jul 14, 2025

New changes are detected. LGTM label has been removed.

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 85046e5 to 3b0e379 Compare July 14, 2025 13:13
@akurinnoy akurinnoy marked this pull request as ready for review July 16, 2025 12:26
@openshift-ci openshift-ci Bot requested a review from dkwon17 July 16, 2025 12:26
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from cf174a0 to 2813408 Compare July 16, 2025 12:26
…disabled

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 74.85380% with 43 lines in your changes missing coverage. Please review.

Project coverage is 40.00%. Comparing base (6e8009c) to head (443872c).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
pkg/library/lifecycle/poststart.go 77.89% 21 Missing ⚠️
pkg/library/status/check.go 65.00% 21 Missing ⚠️
pkg/library/container/container.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1440      +/-   ##
==========================================
+ Coverage   39.57%   40.00%   +0.42%     
==========================================
  Files         160      160              
  Lines       13186    13333     +147     
==========================================
+ Hits         5219     5334     +115     
- Misses       7590     7622      +32     
  Partials      377      377              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Jul 21, 2025

@akurinnoy thank you for the update, but I get the fallback status message when starting problematic-workspace:

'Error creating DevWorkspace deployment: Detected unrecoverable event FailedPostStartHook: [postStart hook] failed with an unknown error (see pod events or container logs for more details)'

instead of the expected message:

Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands terminated by SIGTERM (likely timed out after 30s). Exit code 143.

Events:
image

Is it expected?

Comment thread pkg/library/lifecycle/poststart.go Outdated
Comment thread pkg/library/lifecycle/poststart.go Outdated
…out is disabled

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy
Copy link
Copy Markdown
Collaborator Author

@dkwon17 Hi,

Is it expected?

I couldn't find any proof that this is expected behavior in the Kubernetes docs, but it seems to be the case. I also encountered this behavior while I was testing this PR. I ran the problematic-workspace, and for the first few runs, I got the message with the exact exit code, but for subsequent runs, it was "failed with an unknown error."

@akurinnoy
Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Jul 28, 2025

/retest

Comment thread pkg/library/lifecycle/poststart.go Outdated
Comment thread apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go Outdated
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 434a115 to 2d2ad37 Compare August 4, 2025 09:14
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy
Copy link
Copy Markdown
Collaborator Author

/retest

@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Aug 6, 2025

After more testing, I noticed that this DW fails when the postStartTimeout is set, but succeeds when it is not set:

apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: problematic-workspace
spec:
  started: false
  template:
    components:
      - name: tools
        container:
          image: quay.io/dkwon17/test:test-dir
          memoryLimit: "1Gi"
          memoryRequest: "512Mi"
          cpuRequest: "250m"
          cpuLimit: "1000m"
    commands:
      - id: test
        exec:
          workingDir: '/projects/test dir'
          component: tools
          commandLine: "mkdir mydir"
    events:
      postStart:
        - test

@akurinnoy are you able to reproduce the problem?

@akurinnoy
Copy link
Copy Markdown
Collaborator Author

@dkwon17 I can reproduce this problem. It occurs because the cd command in the fallback is not being quoted correctly. Do you think we should change this behavior?

If so, I see two ways we could fix it:

  1. Correctly quote the cd ... command in the processCommandsWithoutTimeoutFallback;
  2. Unify script generation by getting rid of processCommandsWithoutTimeout.

How would you like to proceed?

@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Aug 11, 2025

As discussed with @akurinnoy

  • the post start workingDir with spaces in the directory should be quoted by the user like so for the workspace to start and run the post start command successfully: /projects/'test dir'
  • we should keep the current behaviour to have consistent behaviour with existing workspaces

@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Aug 11, 2025

/retest

@akurinnoy akurinnoy merged commit 5a0fc87 into devfile:main Aug 15, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants