Skip to content

Add fixes for Delta PrgEnv-nvidia#1258

Merged
sbryngelson merged 2 commits intomasterfrom
DeltaFix
Feb 24, 2026
Merged

Add fixes for Delta PrgEnv-nvidia#1258
sbryngelson merged 2 commits intomasterfrom
DeltaFix

Conversation

@wilfonba
Copy link
Contributor

@wilfonba wilfonba commented Feb 24, 2026

User description

Description

Fix support for Delta's "new" PrgEnv-nvidia environment.

Type of change

  • Bug fix

Testing

Ran examples/3D_performance_test with GPUs using the updated CMake and template.


CodeAnt-AI Description

Fix Delta PrgEnv-nvidia support and NVHPC LTO/IPO handling

What Changed

  • When running with GPUs on the Delta template, MPICH GPU-Direct support is disabled so MPI runs no longer attempt GPU-Direct behavior that caused failures on PrgEnv-nvidia systems.
  • The Delta module list now selects the PrgEnv-nvidia environment for GPU builds and uses the default python module instead of a hard-coded python version, aligning module selection with Delta's environment.
  • Build-time detection now treats NVHPC 24.11 through 25.9 as incompatible with two-pass LTO/IPO: users receive a clear message and two-pass IPO is turned off for those compiler versions to avoid degraded or failing builds.

Impact

✅ Fewer runtime MPI failures on Delta GPU nodes
✅ Fewer build failures with NVHPC 24.x compilers
✅ Correct module selection for Delta GPU workflows

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

  • Chores
    • Updated GPU toolchain environment and Python module dependencies
    • Enhanced NVHPC compiler version compatibility handling for optimization settings (versions 24.11–25.9)
    • Modified GPU-Direct MPI configuration for GPU-enabled builds

Copilot AI review requested due to automatic review settings February 24, 2026 15:33
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 24, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Modifies NVHPC compiler configuration to disable LTO/IPO for specific versions, updates Python and GPU toolchain module specifications, and adjusts Delta environment setup to conditionally disable GPU-Direct MPI when GPU acceleration is enabled.

Changes

Cohort / File(s) Summary
Build System Configuration
CMakeLists.txt
Adds version-specific conditional logic for NVHPC versions 24.11–25.9 to disable two-pass IPO when using Unified memory, indicating LTO/IPO is unsupported in that range.
Toolchain Module Specification
toolchain/modules
Replaces pinned Python 3.11.6 with generic python entry and swaps GPU toolchain from nvhpc/24.1 cuda/12.3.0 openmpi/4.1.5+cuda to PrgEnv-nvidia/8.6.0, affecting loaded GPU environment.
Delta Environment Template
toolchain/templates/delta.mako
Removes hard-coded Delta library path export and adds conditional block to disable GPU-Direct MPI (MPICH_GPU_SUPPORT_ENABLED=0) when GPU acceleration is enabled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

Review effort 3/5, size:S

Suggested reviewers

  • sbryngelson

Poem

🐰 The toolchain hops with nimble care,
NVHPC's quirks we now declare,
GPU Direct sleeps when enabled with grace,
Delta's libraries find their new place,
A rabbit's build now runs so fair! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The pull request description provides a clear summary and identifies it as a bug fix, but lacks required documentation updates and GPU testing confirmations from the template. Confirm whether GPU results match CPU results, specify which GPU(s) were tested (NVIDIA/AMD), and clarify if documentation updates are needed for the Delta environment changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective of the PR—fixing support for Delta's PrgEnv-nvidia environment across CMake, modules, and templates.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch DeltaFix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Feb 24, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 24, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Possible Bug
    The template unconditionally exports MPICH_GPU_SUPPORT_ENABLED=0 whenever GPUs are enabled. That disables MPICH GPU-Direct globally in the job environment even if the loaded MPI implementation is not MPICH (or if no MPI is used). This can unexpectedly change runtime behavior for non-MPICH stacks or mixed environments.

  • Module Resolution / Runtime Libraries
    The GPU entry switches to PrgEnv-nvidia/8.6.0 without explicitly listing CUDA or an MPI implementation. Depending on how PrgEnv-nvidia configures the environment, required CUDA or MPI runtime libraries (and LD_LIBRARY_PATH adjustments) might not be present, causing runtime/migration issues that the removed LD_LIBRARY_PATH workaround previously hid.

  • Compatibility Risk
    The module list now references a generic python (no version). Relying on an unpinned python can load an incompatible interpreter across environments and may break scripts that expect a specific minimum Python version.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 24, 2026

CodeAnt AI finished reviewing your PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates MFC's configuration for NCSA Delta's newer PrgEnv-nvidia environment, migrating from manual module loading to a Cray-style programming environment and adding workarounds for NVHPC compiler issues.

Changes:

  • Migrate Delta GPU configuration from explicit module loading (nvhpc/24.1, cuda/12.3.0, openmpi/4.1.5+cuda) to PrgEnv-nvidia/8.6.0 meta-module
  • Disable GPU-Direct MPI support on Delta to work around stability issues
  • Add NVHPC version check to disable broken LTO/IPO in versions 24.11-25.9

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
toolchain/templates/delta.mako Replace manual LD_LIBRARY_PATH export with MPICH_GPU_SUPPORT_ENABLED=0 for GPU builds
toolchain/modules Switch Delta GPU configuration to PrgEnv-nvidia/8.6.0 and simplify python module loading
CMakeLists.txt Add version check to disable LTO/IPO for NVHPC versions 24.11 through 25.9

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
toolchain/templates/delta.mako (1)

39-41: Scope GPU-Direct disablement to MPI runs and confirm MPI flavor.

Line 39-41 sets an MPICH-specific knob for all GPU jobs. Consider gating this behind mpi (and/or detected MPI flavor) to avoid no-op or unintended effects in non-MPICH/non-MPI runs, and please confirm this is the correct control for Delta’s PrgEnv-nvidia MPI stack.

🔧 Example gating
-% if gpu_enabled:
-    export MPICH_GPU_SUPPORT_ENABLED=0 # Disable GPU-Direct MPI
-% endif
+% if gpu_enabled and mpi:
+    export MPICH_GPU_SUPPORT_ENABLED=0 # Disable GPU-Direct MPI
+% endif
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@toolchain/templates/delta.mako` around lines 39 - 41, The template currently
unconditionally exports MPICH_GPU_SUPPORT_ENABLED when gpu_enabled is true;
change this so the MPICH-specific export is only emitted for MPI runs (e.g.,
when an mpi flag/variable is set or when detected MPI flavor equals "mpich"),
and verify that MPICH_GPU_SUPPORT_ENABLED is the correct knob for Delta’s
PrgEnv-nvidia MPI stack; update the delta.mako template to check both
gpu_enabled and an mpi/detected_mpi_flavor variable before exporting
MPICH_GPU_SUPPORT_ENABLED and add a comment noting the MPI flavor check and that
this is specific to MPICH to avoid affecting non-MPICH or non-MPI jobs.
toolchain/modules (1)

60-60: Ensure the default Python module meets the 3.10+ requirement.

Line 60 now loads an unpinned python module. If the cluster default is <3.10, toolchain scripts may break; consider pinning to a known-good module or adding a version check in the load path.

Based on learnings: toolchain/**/*.py must be compatible with Python 3.10+; do not use from future imports or other backwards-compatibility shims.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@toolchain/modules` at line 60, The unpinned module load ("d-all python") can
pick a system Python older than 3.10; change this to load a pinned interpreter
or add a runtime version guard: either replace the unpinned load with a specific
module (e.g., "d-all python/3.10" or higher) or keep the unpinned load but add a
check in the toolchain bootstrap that inspects sys.version_info and fails with a
clear error if Python < (3,10); update any bootstrap/init logic that calls
"d-all python" and document the required module version for toolchain/**/*.py
compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CMakeLists.txt`:
- Around line 241-244: The version check for NVHPC IPO currently uses exclusive
bounds but the message claims an inclusive range; update the elseif condition to
use inclusive comparisons so the logic matches the text: change the condition on
CMAKE_Fortran_COMPILER_VERSION to VERSION_GREATER_EQUAL "24.11" AND
VERSION_LESS_EQUAL "25.9" (leaving the message and set(NVHPC_USE_TWO_PASS_IPO
FALSE) as-is), or alternatively adjust the message text to say "greater than
24.11 and less than 25.9" if you prefer to keep exclusive bounds.

---

Nitpick comments:
In `@toolchain/modules`:
- Line 60: The unpinned module load ("d-all python") can pick a system Python
older than 3.10; change this to load a pinned interpreter or add a runtime
version guard: either replace the unpinned load with a specific module (e.g.,
"d-all python/3.10" or higher) or keep the unpinned load but add a check in the
toolchain bootstrap that inspects sys.version_info and fails with a clear error
if Python < (3,10); update any bootstrap/init logic that calls "d-all python"
and document the required module version for toolchain/**/*.py compatibility.

In `@toolchain/templates/delta.mako`:
- Around line 39-41: The template currently unconditionally exports
MPICH_GPU_SUPPORT_ENABLED when gpu_enabled is true; change this so the
MPICH-specific export is only emitted for MPI runs (e.g., when an mpi
flag/variable is set or when detected MPI flavor equals "mpich"), and verify
that MPICH_GPU_SUPPORT_ENABLED is the correct knob for Delta’s PrgEnv-nvidia MPI
stack; update the delta.mako template to check both gpu_enabled and an
mpi/detected_mpi_flavor variable before exporting MPICH_GPU_SUPPORT_ENABLED and
add a comment noting the MPI flavor check and that this is specific to MPICH to
avoid affecting non-MPICH or non-MPI jobs.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0da7ca and afe6058.

📒 Files selected for processing (3)
  • CMakeLists.txt
  • toolchain/modules
  • toolchain/templates/delta.mako

@github-actions
Copy link

Claude Code Review

Files changed: 3

  • CMakeLists.txt
  • toolchain/modules
  • toolchain/templates/delta.mako

Summary

  • Replaces hardcoded NVHPC 24.1 + OpenMPI 4.1.5 Delta GPU modules with PrgEnv-nvidia/8.6.0 (Cray PE module that loads NVHPC, MPICH, and CUDA together).
  • Removes the stale hardcoded LD_LIBRARY_PATH workaround and replaces it with MPICH_GPU_SUPPORT_ENABLED=0 to prevent GPU-Direct MPI failures under PrgEnv-nvidia.
  • Adds a new CMake guard to disable two-pass LTO/IPO for NVHPC versions 24.11–25.9 (exclusive) which are reportedly broken with -Mextract/-Minline.
  • Switches Delta's Python module from pinned python/3.11.6 to the system default python.
  • Drops the explicit cmake module from the Delta GPU module list.

Findings

1. CMakeLists.txt:241 — Version boundary is off by one (lower bound)

elseif (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER "24.11" AND CMAKE_Fortran_COMPILER_VERSION VERSION_LESS "25.9")
    message(STATUS "LTO/IPO is not supported in NVHPC Version 24.11 to 25.9. ...")

VERSION_GREATER "24.11" is strictly greater than 24.11, so NVHPC 24.11 itself falls through to the else() branch and gets NVHPC_USE_TWO_PASS_IPO TRUE. The status message says "24.11 to 25.9" and the PR description says "24.11 through 25.9 as incompatible." If 24.11 is actually affected, the condition should be VERSION_GREATER_EQUAL "24.11". If 24.11 is known good (consistent with the fallback text "Use <=24.11 && > 23.11"), then the message should say "strictly greater than 24.11 through < 25.9" — not "24.11 to 25.9."

Similarly, VERSION_LESS "25.9" excludes 25.9 itself (it goes to else(), gets IPO enabled), which matches the guidance "Use >=25.9." The upper bound is correct.

Recommendation: Decide definitively whether 24.11 is broken. If broken: change VERSION_GREATERVERSION_GREATER_EQUAL. If not: fix the status message to say > 24.11 instead of 24.11.


2. toolchain/modules — cmake dropped from d-gpu without a replacement

Before:

d-gpu nvhpc/24.1 cuda/12.3.0 openmpi/4.1.5+cuda cmake

After:

d-gpu PrgEnv-nvidia/8.6.0

The cmake module that was previously loaded explicitly is no longer listed. Other MFC systems that use Cray PEs (e.g., cc-cpu on Carpenter-Cray, tuo-gpu) include cmake separately, and non-PE systems like Oscar (o-gpu cmake/3.26.3) list it explicitly. If PrgEnv-nvidia/8.6.0 does not pull in cmake, MFC builds will fail with "cmake not found." Please confirm cmake is available in the environment after loading this PE, or add cmake to d-gpu as a safety net.


3. CMakeLists.txt:242 — Status message is unclear

"LTO/IPO is not supported in NVHPC Version 24.11 to 25.9. Use >=25.9 or (<=24.11 && > 23.11) Performance will be degraded."

The parenthetical (<=24.11 && > 23.11) uses programming syntax in a user-facing message and is missing a period before "Performance." Consider:

"LTO/IPO is disabled for NVHPC 24.11–25.9 (exclusive) due to known IPO bugs. Upgrade to >=25.9 or use 23.11–24.11 for best performance."

4. delta.mako — MPICH_GPU_SUPPORT_ENABLED=0 disables GPU-Direct with no comment on performance impact

The new block disables GPU-Direct MPI for all GPU runs:

export MPICH_GPU_SUPPORT_ENABLED=0 # Disable GPU-Direct MPI

This forces CPU-staging of GPU buffers during MPI communication, which can significantly hurt strong-scaling on multi-node GPU runs. The inline comment is terse. Consider expanding it:

# PrgEnv-nvidia on Delta uses MPICH without GPU-Direct support; disabling prevents
# runtime errors at the cost of extra host<->device copies during MPI communication.

Improvement opportunities (no blocking issues)

  • Pinned PE vs. pinned components: PrgEnv-nvidia/8.6.0 pins the PE version but hides which exact NVHPC/CUDA/MPI versions are active. If NCSA updates PE 8.6.0's backing modules, behavior may change silently. Consider adding a comment in toolchain/modules documenting which NVHPC version PE 8.6.0 provides (e.g., # PrgEnv-nvidia/8.6.0 → nvhpc 24.7, cuda 12.x, cray-mpich).

  • python without version (toolchain/modules): Other Delta-adjacent systems use pinned Python versions. While unversioned python reduces maintenance burden, it could silently break if NCSA changes the default. This is a minor risk and acceptable, but worth documenting.

  • NVHPC_USE_TWO_PASS_IPO FALSE is redundant (CMakeLists.txt:243): The variable starts unset (falsy) in all the other non-enabling branches (MFC_Unified, < 23.11). Explicitly setting FALSE here is consistent with intent but inconsistent with the other branches. Either set it explicitly in all branches for clarity, or drop it from the new branch to match the existing style.

@github-actions
Copy link

Claude Code Review

3 files changed: CMakeLists.txt, toolchain/modules, toolchain/templates/delta.mako

Summary

  • Adds a CMake version-range guard to disable LTO/IPO for NVHPC 24.11–25.9, which have a known compiler defect
  • Upgrades Delta GPU environment from individual modules (nvhpc/24.1 cuda/12.3.0 openmpi/4.1.5+cuda cmake) to the unified PrgEnv-nvidia/8.6.0 meta-module
  • Relaxes the Python module pin from python/3.11.6 to the unversioned python default
  • Replaces a hardcoded, path-specific LD_LIBRARY_PATH workaround (referencing the now-removed nvhpc-24.1 spack installation) with a conditional MPICH_GPU_SUPPORT_ENABLED=0 to disable GPU-Direct MPI on Delta GPU builds
  • All changes are system-configuration fixes; no Fortran source, solver logic, or shared src/common/ code is touched

Findings

No blocking issues found. Checked for bugs and CLAUDE.md compliance.

Three non-blocking improvement opportunities:

1. Slightly imprecise version-range message — CMakeLists.txt

The new elseif uses strict CMake comparators (VERSION_GREATER "24.11" and VERSION_LESS "25.9"), so NVHPC 24.11 and 25.9 themselves fall through to the working else() branch and get LTO enabled. The code behaviour is correct and consistent with the recommendation text ("Use >=25.9 or (<=24.11 && > 23.11)"). However, the leading phrase "LTO/IPO is not supported in NVHPC Version 24.11 to 25.9" implies those endpoints are also broken. Consider rephrasing to "NVHPC Versions after 24.11 and before 25.9" to eliminate ambiguity.

Relevant lines:

MFC/CMakeLists.txt

Lines 239 to 241 in afe6058

if (MFC_Unified)
message(STATUS "LTO/IPO is not available with NVHPC using Unified Memory")
elseif (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER "24.11" AND CMAKE_Fortran_COMPILER_VERSION VERSION_LESS "25.9")

2. Asymmetric MPICH_GPU_SUPPORT_ENABLED handling — toolchain/templates/delta.mako

The new block sets MPICH_GPU_SUPPORT_ENABLED=0 only for GPU builds and is silent for CPU builds. Other templates (frontier.mako, tuo.mako) set this variable explicitly for both branches (1 for GPU, 0 for CPU). Since CPU runs on Delta never need GPU-Direct MPI, the omission is functionally fine, but adding an explicit % else: branch would make the intent self-documenting and consistent with the rest of the template codebase.

% if gpu_enabled:
    export MPICH_GPU_SUPPORT_ENABLED=0 # Disable GPU-Direct MPI
% else:
    export MPICH_GPU_SUPPORT_ENABLED=0
% endif

(Or simply unconditionally export MPICH_GPU_SUPPORT_ENABLED=0 if it should always be disabled on Delta.)

Relevant lines:

% if gpu_enabled:
export MPICH_GPU_SUPPORT_ENABLED=0 # Disable GPU-Direct MPI

3. Possibly redundant compiler env vars — toolchain/modules

The CC=nvc CXX=nvc++ FC=nvfortran lines were retained verbatim from the pre-PrgEnv-nvidia era. PrgEnv-nvidia/8.6.0 is a Cray-style programming-environment meta-module that typically sets these variables itself. Worth verifying (during the next Delta session) that these explicit overrides are still necessary and don't conflict with what PrgEnv-nvidia sets, to keep the module file minimal.

Relevant lines:

MFC/toolchain/modules

Lines 61 to 62 in afe6058

d-cpu gcc/11.4.0 openmpi
d-gpu PrgEnv-nvidia/8.6.0


🤖 Reviewed by Claude Code (claude-sonnet-4-6)

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.05%. Comparing base (df28255) to head (eee7a24).
⚠️ Report is 17 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1258   +/-   ##
=======================================
  Coverage   44.05%   44.05%           
=======================================
  Files          70       70           
  Lines       20496    20496           
  Branches     1989     1989           
=======================================
  Hits         9030     9030           
  Misses      10328    10328           
  Partials     1138     1138           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson merged commit d5123c3 into master Feb 24, 2026
27 of 33 checks passed
@sbryngelson sbryngelson deleted the DeltaFix branch February 24, 2026 19:43
@github-actions
Copy link

Claude Code Review

No issues found. Checked for bugs and CLAUDE.md compliance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

3 participants