Notes and scripts for AMD profiling of dycore#1047
Notes and scripts for AMD profiling of dycore#1047iomaganaris wants to merge 64 commits intomainfrom
Conversation
| fi | ||
|
|
||
| # Install icon4py, gt4py, DaCe and other basic dependencies using uv | ||
| uv sync --extra all --python $(which python3.12) |
There was a problem hiding this comment.
I would not install all the extras but maybe we properly add cupy-rocm7 as an extra to avoid line 29. I can work on that.
…osure_vars to fix the caching of the dycore programs
| --benchmark-warmup=on \ | ||
| --benchmark-warmup-iterations=30 \ | ||
| --backend=dace_gpu \ | ||
| --grid=icon_benchmark_regional \ |
There was a problem hiding this comment.
| --grid=icon_benchmark_regional \ | |
| --grid=icon_benchmark_global \ |
Since global is our main target for now, maybe we can switch to that.
|
@iomaganaris, I am getting: if I do: ... install succeeds but I get cupy errors while testing: I guess we need to use a specific version of CuPY whose dependency chain is broken? |
|
@sfantao However, I tried it again and I get the same error as you now. |
|
The fix in this case seems to be switching from amd-cupy to cupy-rocm-7-0==14.0.1. CuPy 14.0.1 crashes on ROCm 7.0+ with "__shfl_xor_sync: mask must be 64-bit". But if you already have a working venv, the fix is to switch cupy 14 and patch it: This strips the warp mask for all ROCm versions which is safe — AMD wavefronts |
|
Hi @sfantao, I have updated this PR with the changes from @dganellari in the |
Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@cscs.ch>
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
This Pull Request includes scripts to benchmark and profile the
dycore granuleas well as one of the most time consumingGT4Py Programs of it, thevertically_implicit_solver_at_predictor_step.We'll keep this PR open for interaction and keep it up-to-date with improvements.
The PR includes the following important files:
AMD_INTRODUCTION.md: Includes (hopefully) all the informations necessary to run the benchmark scripts for thedycore granuleand thevertically_implicit_solver_at_predictor_stepas well as an introduction onicon4py,GT4PyandDaCe. There are also some suggestions regarding how to view and understand the generated codeamd_scripts/install_icon4py_venv.sh: Script to installicon4pyalong with all the dependencies necessary to run the profilersamd_scripts/benchmark_dycore.sh: Sbatch script forBeverinto run and time theGT4Py Programs of thedycoreamd_scripts/benchmark_solver.sh: Sbatch script forBeverinto benchark and profile thevertically_implicit_solver_at_predictor_step. Looking at the profiles of the kernels generated by thisGT4Py programis the most interesting topic as it should improve the performance across most of the otherdycoreGT4Py Programs as wellCurrently, based on #1018 which points to GT4Py/main (which will become GT4Py v1.1.4 in the next week).