Skip to content

Add support for nonuniform data structures#2464

Merged
dennisYatunin merged 1 commit intomainfrom
dy/nonuniform_structs
Apr 3, 2026
Merged

Add support for nonuniform data structures#2464
dennisYatunin merged 1 commit intomainfrom
dy/nonuniform_structs

Conversation

@dennisYatunin
Copy link
Copy Markdown
Member

@dennisYatunin dennisYatunin commented Mar 7, 2026

Purpose

This PR refactors src/DataLayouts/struct.jl to support nonuniform data structures.

This will allow us to store combinations of 64-bit integers and 32-bit floats within a single Field, so that we can proceed with CliMA/ClimaAtmos.jl#4336 (which addresses CliMA/ClimaParams.jl#280). More generally, this will simplify the process of generating Fields of boolean masks, integer indices, or any other data structure needed in ClimaAtmos, ClimaLand, etc.

Content

  • Avoid potential invalid basetype errors by defining default_basetype and checked_valid_basetype
    • Can now select an appropriate array element type for the result of any ClimaCore operation
  • Define bitcast_struct as a recursively unrolled extension of Core.bitcast for composite types
    • This will serve as a GPU-compatible form of reinterpret, since the built-in form only works with heap memory
  • Merge IndexLinear and IndexCartesian forms of get_struct and set_struct! into a unified implementation
  • Simplify get_struct and set_struct! by separating unrolled type manipulations from memory operations
    • All type manipulations are delegated to bitcast_struct
    • All memory reads/writes are performed in a single loop at the top level
    • This may improve inlining by the GPUCompiler, potentially fixing some related performance/latency issues
  • Simplify getindex and setindex! for DataLayouts, dropping the unused F = 1 from each CartesianIndex
  • Simplify getproperty for DataLayouts, reducing the amount of duplicate code and generated functions
  • Simplify signatures of parent_array_type and promote_parent_array_type by avoiding unused TypeVars
  • Rename typesize to num_basetypes, making it consistent with the other internals in struct.jl
  • Rename low-level method of replace_basetype to replace_type_parameter, and move it to Utilities
  • Add a few new unit tests, and reduce code duplication in preexisting unit tests
  • Fix some bugs that were uncovered after removing spurious @inbounds annotations in getindex/setindex!
    1. missing boundary conditions and unnecessary operators in the examples
    2. a missing @inbounds annotation in the 2D weak curl operator

Performance Notes

After exploring many different ways of implementing get_struct/set_struct!/bitcast_struct, I have kept the version that came closest to matching our current performance:

  • The coarse AMIP SYPD is up by around half a percent, from 0.1406 with the main branch to 0.1414 with this PR
  • Comparing ClimaAtmos's CI for its main branch against a branch that uses this PR, a few examples have deteriorated in performance, but most have improved; for example,
    1. the SYPD of Prognostic EDMF TRMM in a column (implicit) has worsened from 2.064 to 1.47
    2. the SYPD of held suarez has improved from 0.383 to 0.403
    3. the SYPD of GPU: Schar Mountain Test (2D, Float32) has improved from 0.178 to 0.195
    4. the SYPD of GPU: Prognostic EDMF aquaplanet has improved from 0.262 to 0.274
    • However, most these differences could potentially be explained by the changes in FD Operator CUDAExt #2466, so they may not be very significant
  • Comparing ClimaCore's CI for its main branch against this PR, there is a similar mix of positive and negative cases; for example,
    1. the walltimes of all Unit: operator matrices (GPU) unit tests have improved by around 10%
    2. the walltimes of most Unit: operator matrices (CPU) unit tests have improved by 10--20%, while the walltimes of a few others have worsened by nearly a factor of 2
    3. the walltime of Schar Mountain total energy (topography mesh interface) has worsened from 221 to 369
    4. the walltime of 3D sphere dry Held-Suarez (ρe) has improved from 625 to 586

The implementation I've chosen is limited to a single generated function (excluding compile-time string interpolations), bitcast_struct, which is defined in a way that forces the LLVM compiler to inline array reads without too much of a latency penalty. This seems to be a key feature of our original implementation, which was full of recursive generated functions that attempt to force inlining. For the specific case of get_struct and set_struct!, inlining of array read instructions is needed to avoid unreasonably large stack memory allocations, since we rely on the LLVM compiler to eliminate read instructions and corresponding allocations for unused values (e.g., most of the LocalGeometry struct fields).

There is probably a better way to force inlining than the generated function I used here, but I think we'll need to implement more rigorous unit tests before we can figure out what that is. We could drawn inspiration from Julia's own unit tests for inlining, or from the analogous tests in CUDA.jl. I compared around a dozen different alternatives for this PR, and I chose the one that happened to produce the best overall results without an excessive latency penalty. But, if we want to guarantee optimal performance across all examples, we'll need to get a better understanding of the compiler's inlining heuristics.

Checklist

  • Code follows the style guidelines OR N/A.
  • Unit tests are included OR N/A.
  • Code is exercised in an integration test OR N/A.
  • Documentation has been added/updated OR N/A.

@dennisYatunin dennisYatunin force-pushed the dy/nonuniform_structs branch 17 times, most recently from d8b7216 to 03c322e Compare March 14, 2026 05:42
@dennisYatunin dennisYatunin force-pushed the dy/nonuniform_structs branch 11 times, most recently from b923c85 to 972935f Compare March 18, 2026 17:46
@dennisYatunin dennisYatunin marked this pull request as ready for review March 18, 2026 20:34
@dennisYatunin dennisYatunin force-pushed the dy/nonuniform_structs branch 2 times, most recently from c492a62 to 5e2bcbf Compare March 20, 2026 18:19
Copy link
Copy Markdown
Member

@nefrathenrici nefrathenrici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed them, but it would be good to exercise bitcast_struct for non-uniform structs before merging.

Comment thread ext/cuda/operators_sem_shmem.jl Outdated
Comment thread ext/cuda/operators_sem_shmem.jl Outdated
Comment thread src/DataLayouts/bitcast_struct.jl Outdated
Comment thread src/DataLayouts/bitcast_struct.jl Outdated
@dennisYatunin dennisYatunin force-pushed the dy/nonuniform_structs branch 24 times, most recently from 5826347 to 4ab281d Compare March 28, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants