Skip to content

auto node remediation fixes and npd documentation update#466

Merged
sajmera-pensando merged 2 commits intoROCm:mainfrom
biluriuday:cp-anr5
Mar 11, 2026
Merged

auto node remediation fixes and npd documentation update#466
sajmera-pensando merged 2 commits intoROCm:mainfrom
biluriuday:cp-anr5

Conversation

@biluriuday
Copy link
Contributor

@biluriuday biluriuday commented Mar 11, 2026

Motivation

Technical Details

cherry-picked below PRs from internal repo:
https://github.com/pensando/gpu-operator/pull/1168
https://github.com/pensando/gpu-operator/pull/1179

Test Plan

Test Result

Submission Checklist

* update node problem detector documentation

* update NPD documentation page

(cherry picked from commit fa9da15)
* anr fixes

* enable sim e2es

* modify maxparallel workflows data type

* handle delete remediation reconcile when enable is false

* address review comments

(cherry picked from commit 84d0752)
@sajmera-pensando sajmera-pensando merged commit 184875c into ROCm:main Mar 11, 2026
3 checks passed
@biluriuday biluriuday deleted the cp-anr5 branch March 12, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants