Notes and scripts for AMD profiling of dycore by iomaganaris · Pull Request #1047 · C2SM/icon4py

iomaganaris · 2026-02-05T16:19:24Z

This Pull Request includes scripts to benchmark and profile the dycore granule as well as one of the most time consuming GT4Py Programs of it, the vertically_implicit_solver_at_predictor_step.

We'll keep this PR open for interaction and keep it up-to-date with improvements.

The PR includes the following important files:

AMD_INTRODUCTION.md: Includes (hopefully) all the informations necessary to run the benchmark scripts for the dycore granule and the vertically_implicit_solver_at_predictor_step as well as an introduction on icon4py, GT4Py and DaCe. There are also some suggestions regarding how to view and understand the generated code
amd_scripts/install_icon4py_venv.sh: Script to install icon4py along with all the dependencies necessary to run the profilers
amd_scripts/benchmark_dycore.sh: Sbatch script for Beverin to run and time the GT4Py Programs of the dycore
amd_scripts/benchmark_solver.sh: Sbatch script for Beverin to benchark and profile the vertically_implicit_solver_at_predictor_step. Looking at the profiles of the kernels generated by this GT4Py program is the most interesting topic as it should improve the performance across most of the other dycore GT4Py Programs as well

Currently, based on #1018 which points to GT4Py/main (which will become GT4Py v1.1.4 in the next week).

havogt · 2026-02-06T10:14:32Z

+fi
+
+# Install icon4py, gt4py, DaCe and other basic dependencies using uv
+uv sync --extra all --python $(which python3.12)


I would not install all the extras but maybe we properly add cupy-rocm7 as an extra to avoid line 29. I can work on that.

…osure_vars to fix the caching of the dycore programs

havogt · 2026-02-09T08:04:54Z

+    --benchmark-warmup=on \
+    --benchmark-warmup-iterations=30 \
+    --backend=dace_gpu \
+    --grid=icon_benchmark_regional \


Suggested change

--grid=icon_benchmark_regional \

--grid=icon_benchmark_global \

Since global is our main target for now, maybe we can switch to that.

sfantao · 2026-04-13T17:28:00Z

@iomaganaris, I am getting:

Resolved 258 packages in 6ms                                                                                                                                                                                                           
  × Failed to download `amd-cupy==13.5.1`                                                                                                                                                                                              
  ╰─▶ Hash mismatch for `amd-cupy==13.5.1`                                                                                                                                                                                             
                                                                                                                                                                                                                                       
      Expected:                                                                                                                                                                                                                        
        sha256:de3138281e2711e06efaf49a31310d0d4824998e18d43e13e288a0e52ca75ec0                                                                                                                                                        
                                                                                                                                                                                                                                       
      Computed:                                                                                                                                                                                                                        
        sha256:c3f586d3786fb02a606148a6cc5662411bb143fa6f2fc7533ce1d98b77a1ed1c                                                                                                                                                        
  help: `amd-cupy` (v13.5.1) was included because `icon4py[rocm7-0]` (v0.0.6) depends on `icon4py-common[rocm7-0]` (v0.0.6) which depends on `amd-cupy`

if I do:

    sed -i 's/de3138281e2711e06efaf49a31310d0d4824998e18d43e13e288a0e52ca75ec0/c3f586d3786fb02a606148a6cc5662411bb143fa6f2fc7533ce1d98b77a1ed1c/g' uv.lock
  bash amd_scripts/install_icon4py_venv.sh

... install succeeds but I get cupy errors while testing:

============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/bin/python3
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=100 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=True warmup_iterations=30)
rootdir: /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py
configfile: pyproject.toml
plugins: factoryboy-2.7.0, dash-4.1.0, Faker-33.1.0, xdist-3.6.1, devtools-0.12.2, cov-6.0.0, mpi-0.6, benchmark-5.1.0, unused-fixtures-0.2.0
collecting ... collected 1 item

model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False] ERROR

==================================== ERRORS ====================================
_________ ERROR at setup of test_benchmark_solve_nonhydro[True-False] __________

self = <cupy.cuda.compiler._NVRTCProgram object at 0x14ab043be0f0>
options = ('--std=c++14', '-I/capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/...rdu63qmj6/lib/clang/20/include/cuda_wrappers', '-I/usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14', ...)
log_stream = None

  def compile(self, options=(), log_stream=None):
      try:
          if self.name_expressions:
              for ker in self.name_expressions:
                  nvrtc.addNameExpression(self.ptr, ker)
>           nvrtc.compileProgram(self.ptr, options)

.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:757: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cupy_backends/cuda/libs/nvrtc.pyx:125: in cupy_backends.cuda.libs.nvrtc.compileProgram
  ???
cupy_backends/cuda/libs/nvrtc.pyx:138: in cupy_backends.cuda.libs.nvrtc.compileProgram
  ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cupy_backends.cuda.libs.nvrtc.NVRTCError: HIPRTC_ERROR_COMPILATION (6)

cupy_backends/cuda/libs/nvrtc.pyx:53: NVRTCError

During handling of the above exception, another exception occurred:

request = <SubRequest 'grid_manager' for <Function test_benchmark_solve_nonhydro[True-False]>>
backend_like = {'backend_factory': <function make_custom_dace_backend at 0x14ab0a1767a0>, 'device': <DeviceType.ROCM: 10>}

  @pytest.fixture(scope="session")
  def grid_manager(
      request: pytest.FixtureRequest, backend_like: model_backends.BackendLike
  ) -> gm.GridManager | None:
      """
      Fixture for providing a grid_manager instance.
  
      The provided grid instance is based on the configuration specified in the
      pytest command line option `--grid <grid_name>:<grid_levels>`, where `<grid_name>`
      might refer to a known grid configuration or to an existing ICON NetCDF grid file,
      and `<grid_levels>` specifies the number of vertical levels to use (optional).
      """
      name, num_levels = _evaluate_grid_option(request)
  
      allocator = model_backends.get_allocator(backend_like)
  
      if name in VALID_GRID_PRESETS:
>           grid_manager = _get_grid_manager_from_preset(
              name, num_levels=num_levels, allocator=allocator
          )

model/testing/src/icon4py/model/testing/fixtures/stencil_tests.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
model/testing/src/icon4py/model/testing/fixtures/stencil_tests.py:57: in _get_grid_manager_from_preset
  return grid_utils.get_grid_manager_from_identifier(
model/testing/src/icon4py/model/testing/grid_utils.py:49: in get_grid_manager_from_identifier
  return get_grid_manager(
model/testing/src/icon4py/model/testing/grid_utils.py:74: in get_grid_manager
  manager(allocator=allocator, keep_skip_values=keep_skip_values)
model/common/src/icon4py/model/common/grid/grid_manager.py:133: in __call__
  self._geometry = self._read_geometry_fields(allocator)
model/common/src/icon4py/model/common/grid/grid_manager.py:278: in _read_geometry_fields
  gridfile.GeometryName.EDGE_CELL_DISTANCE.value: gtx.as_field(
.venv/lib/python3.12/site-packages/gt4py/next/constructors.py:281: in as_field
  field[...] = field.array_ns.asarray(data)
.venv/lib/python3.12/site-packages/gt4py/next/embedded/nd_array_field.py:365: in __setitem__
  self._ndarray[target_slice] = value  # type: ignore[index] # np and cp allow index assignment, jax overrides
cupy/_core/core.pyx:1688: in cupy._core.core._ndarray_base.__setitem__
  ???
cupy/_core/_routines_indexing.pyx:51: in cupy._core._routines_indexing._ndarray_setitem
  ???
cupy/_core/_routines_indexing.pyx:1027: in cupy._core._routines_indexing._scatter_op
  ???
cupy/_core/_kernel.pyx:1374: in cupy._core._kernel.ufunc.__call__
  ???
cupy/_core/_kernel.pyx:1401: in cupy._core._kernel.ufunc._get_ufunc_kernel
  ???
cupy/_core/_kernel.pyx:1082: in cupy._core._kernel._get_ufunc_kernel
  ???
cupy/_core/_kernel.pyx:94: in cupy._core._kernel._get_simple_elementwise_kernel
  ???
cupy/_core/_kernel.pyx:82: in cupy._core._kernel._get_simple_elementwise_kernel_from_code
  ???
cupy/_core/core.pyx:2396: in cupy._core.core.compile_with_cache
  ???
.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:558: in _compile_module_with_cache
  return _compile_with_cache_hip(
.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:979: in _compile_with_cache_hip
  binary, mapping = compile_using_nvrtc(
.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:397: in compile_using_nvrtc
  return _compile(source, options, cu_path,
.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:381: in _compile
  compiled_obj, mapping = prog.compile(options, log_stream)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <cupy.cuda.compiler._NVRTCProgram object at 0x14ab043be0f0>
options = ('--std=c++14', '-I/capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/...rdu63qmj6/lib/clang/20/include/cuda_wrappers', '-I/usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14', ...)
log_stream = None

  def compile(self, options=(), log_stream=None):
      try:
          if self.name_expressions:
              for ker in self.name_expressions:
                  nvrtc.addNameExpression(self.ptr, ker)
          nvrtc.compileProgram(self.ptr, options)
          mapping = None
          if self.name_expressions:
              mapping = {}
              for ker in self.name_expressions:
                  mapping[ker] = nvrtc.getLoweredName(self.ptr, ker)
          if log_stream is not None:
              log_stream.write(nvrtc.getProgramLog(self.ptr))
          # This is to ensure backwards compatibility with nvrtc
          if self.method == 'cubin':
              return nvrtc.getCUBIN(self.ptr), mapping
          elif self.method == 'ptx':
              return nvrtc.getPTX(self.ptr), mapping
          # TODO(leofang): support JIT LTO using nvrtc.getNVVM()?
          # need -dlto and -arch=compute_XX
          else:
              raise RuntimeError('Unknown NVRTC compile method')
      except nvrtc.NVRTCError:
          log = nvrtc.getProgramLog(self.ptr)
>           raise CompileException(log, self.src, self.name, options,
                                 'nvrtc' if not runtime.is_hip else 'hiprtc')
E           cupy.cuda.compiler.CompileException: In file included from /tmp/comgr-433362/input/tmp/tmpzr6gi4m6/f6c932f8af819fb6fd038aa82113ecb87630145a.hsaco.cu:2:
E           In file included from /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:6:
E           /usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14/initializer_list:41:15: error: expected '{'
E              41 | namespace std _GLIBCXX_VISIBILITY(default)
E                 |               ^
E           /usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14/initializer_list:41:15: error: a type specifier is required for all declarations
E           /usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14/initializer_list:41:35: error: expected expression
E              41 | namespace std _GLIBCXX_VISIBILITY(default)
E                 |                                   ^
E           /usr/lib64/gcc/x86_64-suse-linux/14/../../../../include/c++/14/initializer_list:41:43: error: expected ';' after top level declarator
E              41 | namespace std _GLIBCXX_VISIBILITY(default)
E                 |                                           ^
E           In file included from /tmp/comgr-433362/input/tmp/tmpzr6gi4m6/f6c932f8af819fb6fd038aa82113ecb87630145a.hsaco.cu:2:
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:438:32: error: no template named 'initializer_list' in namespace 'std'
E             438 |                     const std::initializer_list<Int> shape)
E                 |                           ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:446:32: error: no template named 'initializer_list' in namespace 'std'
E             446 |                     const std::initializer_list<Int1> shape,
E                 |                           ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:447:32: error: no template named 'initializer_list' in namespace 'std'
E             447 |                     const std::initializer_list<Int2> strides)
E                 |                           ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:485:39: error: no template named 'initializer_list' in namespace 'std'
E             485 |   __device__ T& operator[](const std::initializer_list<Int> idx_) {
E                 |                                  ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:493:45: error: no template named 'initializer_list' in namespace 'std'
E             493 |   __device__ const T& operator[](const std::initializer_list<Int> idx_) const {
E                 |                                        ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:635:32: error: no template named 'initializer_list' in namespace 'std'
E             635 |                     const std::initializer_list<int> shape,
E                 |                           ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:636:32: error: no template named 'initializer_list' in namespace 'std'
E             636 |                     const std::initializer_list<int> strides)
E                 |                           ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:708:34: error: no template named 'initializer_list' in namespace 'std'
E             708 |   __device__ CIndexer(const std::initializer_list<Int> shape)
E                 |                             ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:715:34: error: no template named 'initializer_list' in namespace 'std'
E             715 |   __device__ CIndexer(const std::initializer_list<Int1> shape,
E                 |                             ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:716:34: error: no template named 'initializer_list' in namespace 'std'
E             716 |                       const std::initializer_list<Int2> index)
E                 |                             ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:794:34: error: no template named 'initializer_list' in namespace 'std'
E             794 |   __device__ CIndexer(const std::initializer_list<int> shape)
E                 |                             ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:800:34: error: no template named 'initializer_list' in namespace 'std'
E             800 |   __device__ CIndexer(const std::initializer_list<int> shape,
E                 |                             ~~~~~^
E           /capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:801:34: error: no template named 'initializer_list' in namespace 'std'
E             801 |                       const std::initializer_list<int> index)
E                 |                             ~~~~~^
E           17 errors generated when compiling for gfx942.

.venv/lib/python3.12/site-packages/cupy/cuda/compiler.py:776: CompileException
=============================== warnings summary ===============================
.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1278
/capstor/store/cscs/director2/g174/sfantao/icon4py/icon4py/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1278: PytestConfigWarning: assertions not in test modules or plugins will be ignored because assert statements are not executed by the underlying Python interpreter (are you using python -O?)

  self._warn_about_missing_assertion(mode)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
ERROR model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False]
========================= 1 warning, 1 error in 9.28s ==========================

I guess we need to use a specific version of CuPY whose dependency chain is broken?

philip-paul-mueller · 2026-04-14T17:08:25Z

@sfantao
I saw the download issue as well, after a while I realized that the problem is the index itself.
I found out that depending from where (CSCS or my home) I visited https://pypi.amd.com/rocm-7.0.2/simple I got different pages.
I suspect a that there are multiple servers (load balancing) and one contains a bad (for whatever reason) index.
However, after modifying the lock (as you did) I was able to install and run it normally.

However, I tried it again and I get the same error as you now.

dganellari · 2026-04-15T07:59:29Z

@sfantao

The fix in this case seems to be switching from amd-cupy to cupy-rocm-7-0==14.0.1.

CuPy 14.0.1 crashes on ROCm 7.0+ with "__shfl_xor_sync: mask must be 64-bit".
I will soon patch the amd_scripts/install_icon4py_venv.sh script to fix it automatically after uv sync.

But if you already have a working venv, the fix is to switch cupy 14 and patch it:

pip uninstall amd-cupy -y
pip install cupy-rocm-7-0==14.0.1

sed -i 's/#if (HIP_VERSION < 60200000) || defined(HIP_DISABLE_WARP_SYNC_BUILTINS)/#if 1/' \
  .venv/lib/python3.12/site-packages/cupy/_core/include/cupy/hip_workaround.cuh

This strips the warp mask for all ROCm versions which is safe — AMD wavefronts
are lock-step so the mask is unused. Upstream fix: cupy/cupy#9748

iomaganaris · 2026-04-15T09:36:17Z

Hi @sfantao,

I have updated this PR with the changes from @dganellari in the icon4py installation scripts so the installation of cupy-rocm-7-0==14.0.1 and its patch works out of the box.
I have also updated this branch to the icon4py main branch and did the same also for the gt4py branch used by this branch

Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@cscs.ch>

github-actions · 2026-04-23T09:54:15Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

edopao and others added 11 commits January 29, 2026 10:52

update gt4py version

be4cee4

switch gt4py branch

6ecff32

update uv lock

1c9c744

edit import metrics

a1e753f

switch gt4py branch

b45b9b1

edit import metrics

517d122

edit import metrics

672b4f0

Merge branch 'main' into update_dace_version

9b2662d

Update DaCe version

532c125

Update the gt4py commit

991b6b8

Initial amd notes and scripts

f194d83

iomaganaris requested a review from havogt February 5, 2026 16:19

havogt and others added 5 commits February 5, 2026 17:20

Pre-compilation fix with_backend

1eb4708

Fixes to the notes

30fe86c

Additional comments in the scripts

4d13d82

Fix gtx_metrics

81e7a24

Clean up setup script

47e5e48