Skip to content

MIBM Performance Optimizations#1157

Open
danieljvickers wants to merge 59 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations
Open

MIBM Performance Optimizations#1157
danieljvickers wants to merge 59 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations

Conversation

@danieljvickers
Copy link
Member

@danieljvickers danieljvickers commented Feb 17, 2026

User description

Description

Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.

This PR has extended optimization to STL IBs, which should significantly improve accuracy, performance, and code cleanliness. The primary optimizations are as follows:

  • IB compute is now fully supported on the GPU. Processing of the mesh into memory has been parallelized, as well as IB marker generation and levelset compute.
  • Interpolation of IB markers was removed in favor of a projection-based distance search. This means that we no longer need to interpolate (saving significant per-processing time and memory), we iterate over significantly fewer vertices (by a factor of 50 for the coarsest grids), and we get exact values for the levelset instead of approximations based on how fine the interpolated mesh was.
  • Unified distance and normal compute. We not longer perform a distance search and then a separate normal vector search. This is all handled in a single subroutine, cutting search time in half.
  • Deleting about 1000 lines of code related to the maintenance and compute of STL models, leading to a cleaner base to build on
  • Modified STL generation to be cell center based instead of volume fraction based. This prevents edge cases where the cell center is outside the model, but the point is labeled as inside the mode. This error causes levelsets that point into the immersed boundaries and because instabilities

Type of change

  • New feature
  • Refactor
  • Other: Performance Tuning

Testing

All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:

import json
import argparse

num_cells = [240, 240, 240]
dim = [8., 8., 8.]
num_particles = 1000

# create a stationary fluid
case = {
    "run_time_info": "T",
    "parallel_io": "T",
    "m": num_cells[0]-1,
    "n": num_cells[1]-1,
    "p": num_cells[2]-1,
    "dt": 0.005,
    "t_step_start": 0,
    "t_step_stop": 2,
    "t_step_save": 1,
    "num_patches": 1,
    "model_eqns": 2,
    "alt_soundspeed": "F",
    "num_fluids": 1,
    "mpp_lim": "F",
    "mixture_err": "T",
    "time_stepper": 3,
    "recon_type": 1,
    "weno_eps": 1e-16,
    "riemann_solver": 2,
    "wave_speeds": 1,
    "avg_state": 2,
    "precision": 2,
    "format": 1,
    "prim_vars_wrt": "T",
    "E_wrt": "T",
    "viscous": "T",
    "x_domain%beg": -0.5*dim[0],
    "x_domain%end": 0.5*dim[0],
    "y_domain%beg": -0.5*dim[1],
    "y_domain%end": 0.5*dim[1],
    "z_domain%beg": -0.5*dim[2],
    "z_domain%end": 0.5*dim[2],
    "bc_x%beg": -3,
    "bc_x%end": -3,
    "bc_y%beg": -3,
    "bc_y%end": -3,
    "bc_z%beg": -3,
    "bc_z%end": -3,
    "patch_icpp(1)%geometry": 9,
    "patch_icpp(1)%z_centroid": 0.0,
    "patch_icpp(1)%length_z": dim[2],
    "patch_icpp(1)%y_centroid": 0.0,
    "patch_icpp(1)%length_y": dim[1],
    "patch_icpp(1)%x_centroid": 0.0,
    "patch_icpp(1)%length_x": dim[0],
    "weno_order": 5,
    "patch_icpp(1)%pres": 1.0,
    "patch_icpp(1)%alpha_rho(1)": 1.0,
    "patch_icpp(1)%alpha(1)": 1.0,
    "patch_icpp(1)%vel(1)": 0.0,
    "patch_icpp(1)%vel(2)": 0.0,
    "patch_icpp(1)%vel(3)": 0.0,
    "fluid_pp(1)%gamma": 2.5000000000000004,
    "fluid_pp(1)%pi_inf": 0.0,
    "fluid_pp(1)%Re(1)": 2500000,
}

import random
random.seed(42)

dx = [float(dim[i]) / float(num_cells[i]) for i in range(3)]

#set particle properties
particle_radius_cells = 5 # particle radius in grid cells
particle_cell_spacing = particle_radius_cells*2 + 5 # set the spacing to be double the radius plus 5 to garuntee no image points in other IBs
mpi_cell_spacing = particle_radius_cells + 5 # space safely away from MPI halo regions to prevent out of bounds errors
genreation_bounds = [6., 6., 6.] # generate particles in this box safely away from the boundary
velocity_magnitude = 1.

# convert non-dimnesional grid cell units to the units of the simulation
radius_units = float(particle_radius_cells) * dx[0]
paticle_units_spacing = float(particle_cell_spacing) * dx[0]
paticle_units_spacing_squared = paticle_units_spacing**2
mpi_units_spacing = [float(mpi_cell_spacing) * dx[i] for i in range(3)]

# generate an array of xyz values that garuntee non-overlapping grid cells
particles = []
while len(particles) < num_particles:
  # generate a completely random position
  position = [random.random() for i in range(3)]
  position = [(position[i] - 0.5) * genreation_bounds[i] for i in range(3)]

  # first check if the particle is too close to the MPI halo regions
  valid = True
  # for i in range(3):
  #   valid = valid and abs(position[i]) >= mpi_units_spacing[i]

  # check for the minimum spacing between all particles, exiting once we find an error
  for particle in particles:
    distance_squared = sum([(particle[i] - position[i])**2 for i in range(3)])
    valid = valid and distance_squared >= paticle_units_spacing_squared
    if not valid:
      break

  if valid:
    particles.append(position)
    # print(f"\rProgress: {100.*float(len(particles))/float(num_particles)}%", end="", flush=True)

# print()
# convert out array of positions to valid JSON for the immersed boundary
ib_properties = {"ib": "T", "num_ibs": num_particles,}
for i in range(len(particles)):
  ib_properties[f'patch_ib({i+1})%radius'] = radius_units
  ib_properties[f'patch_ib({i+1})%slip'] = 'F'
  ib_properties[f'patch_ib({i+1})%geometry'] = 8
  ib_properties[f'patch_ib({i+1})%moving_ibm'] = 2
  ib_properties[f'patch_ib({i+1})%mass'] = 1.
  ib_properties[f'patch_ib({i+1})%x_centroid'] = particles[i][0]
  ib_properties[f'patch_ib({i+1})%y_centroid'] = particles[i][1]
  ib_properties[f'patch_ib({i+1})%z_centroid'] = particles[i][2]
  # move the particle away radially to garuntee they never touch during the simulation
  position_mag = (sum([(particles[i][j])**2 for j in range(3)]))**0.5
  for j in range(3):
    ib_properties[f'patch_ib({i+1})%vel({j+1})'] = particles[i][j] * velocity_magnitude / position_mag

print(json.dumps({**case, **ib_properties}))

These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:

Screenshot 2026-02-16 at 2 57 19 PM

Following these optimizations, the same profile achieves almost 50 ms per RK substep:
Screenshot 2026-02-17 at 10 51 00 AM

For STLs, the optimizations were tested on a 822,000 vertex mesh of a Mach 0.4 corgi, given by this STL:
https://www.thingiverse.com/thing:4721563/files

The final simulation finished in a total of 25 minutes on a 200^3 grid for 4k time steps on a single A100 GPU. All of the code related to the STL model (file reading, preprocessing, IB marker generation, and levelset compute) took only 20 seconds of the run time. The result of that simulation can be viewed here:
https://www.youtube.com/watch?v=h44BNCKo0Hs

Checklist

  • I updated documentation if user-facing behavior changed

See the developer guide for full coding standards.

GPU changes (expand if you modified src/simulation/)
  • GPU results match CPU results
  • Tested on NVIDIA GPU or AMD GPU

Summary by CodeRabbit

  • New Features

    • GPU-accelerated, 3D-capable processing for STL/model levelset and immersed-boundary workflows.
  • Enhancements

    • Raised maximum immersed-boundary patches to 1000.
    • Packed GPU-friendly model data for faster distance/normals and inside tests.
    • Narrowed per-patch work via bounding-index domain slicing.
    • Added runtime profiling markers for performance analysis.
  • Behavior Changes

    • STL interpolation workflow removed (interpolation no longer performed).
    • MPI send/receive routine for IB buffers removed.
  • Tests/Validation

    • Updated parameter-count expectations and validation limits for larger model capacity.

CodeAnt-AI Description

GPU-accelerate STL immersed-boundary compute and support up to 1000 IBs

What Changed

  • STL models are read, transformed, packed into GPU-friendly flat arrays and uploaded so immersed-boundary (IB) marker and levelset checks run on the device instead of on CPU-side model structures
  • Levelset/inside-model queries now use GPU-ready routines (including a GPU-safe random-number step and a flattened triangle intersection path) so per-cell STL inside/distance tests execute on the GPU
  • IB patch routines and levelset application limit loops to grid regions that overlap each patch (binary-searched bounding indices), reducing the number of cells inspected for every patch
  • API and bookkeeping now handle many more IBs: the global patch limit raised to 1000 and model metadata (counts, bounding boxes) are kept in GPU-updatable arrays
  • 2D/3D model distance and normal computation changed to projection-based nearest-point checks (exact triangle/edge projections) and triangle normals are normalized when reading STLs
  • Minor runtime observability: NVTX profiling ranges added around immersed-boundary propagation

Impact

✅ Faster IB marker generation
✅ Lower CPU during IB setup and levelset evaluation
✅ Support up to 1000 immersed boundaries

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@danieljvickers danieljvickers marked this pull request as ready for review February 19, 2026 18:39
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Feb 20, 2026
coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added size:XL This PR changes 500-999 lines, ignoring generated files and removed size:XL This PR changes 500-999 lines, ignoring generated files labels Feb 20, 2026
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

cubic-dev-ai[bot]

This comment was marked as off-topic.

@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@sbryngelson
Copy link
Member

Claude Code Review

Head SHA: 7abaffc747841846b0e65447ef86f64bd8a70503
Files changed: 21


Summary

The PR makes significant and valuable performance improvements by moving IBM computations to the GPU with proper Fypp macro usage. The new projection-based distance computation is mathematically cleaner than the interpolation approach. GPU macro usage is correct throughout — all GPU parallelism uses GPU_PARALLEL_LOOP/END_GPU_PARALLEL_LOOP, GPU_ATOMIC, GPU_ROUTINE, GPU_DECLARE, GPU_UPDATE. No raw !$acc or !$omp directives found. Precision discipline is respected.

However, there are several issues that should be addressed before merging.


Critical Issues

1. stl_bounding_boxes allocated inside a loop — will crash with multiple STL patches

src/common/m_model.fpp, in s_instantiate_STL_models:

stl_bounding_boxes is allocated inside the do patch_id = 1, num_ibs loop with allocate(stl_bounding_boxes(patch_id, 1:3, 1:3)). On the first STL patch, this allocates a (1, 3, 3) array. When a second STL patch is encountered, the array is already allocated, causing a Fortran runtime error.

Fix: Allocate stl_bounding_boxes(1:num_ibs, 1:3, 1:3) once before the loop.

2. num_patches_max raised from 10 to 1000 — massive unintended memory inflation

src/common/m_constants.fpp, line 26.

num_patches_max controls both IB patches and initial condition patches. It sizes patch_icpp(num_patches_max) and patch_ib(num_patches_max) in both pre_process and simulation. Each ic_patch_parameters contains alter_patch(0:num_patches_max-1) (now 1000 booleans), plus many reals and strings. With 1000 entries, patch_icpp alone balloons from ~40KB to ~4MB of static data per MPI rank — even for cases with zero IBs.

Fix: Introduce a separate num_ibs_max = 1000 constant for IB patches, leaving num_patches_max at 10.

3. Division by zero in s_distance_normals_2D when point coincides with vertex

src/common/m_model.fpp, in s_distance_normals_2D, the branches for t < 0 and t > 1:

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm = norm/dist   ! Division by zero if dist == 0

The 3D counterpart (s_distance_normals_3D) correctly guards with if (norm_mag > 0._wp), but the 2D version does not.

Fix: Add if (dist > 0._wp) guard before normalizing.

4. Missing @:DEALLOCATE for GPU arrays allocated in s_instantiate_STL_models

src/common/m_model.fpp allocates these module-level GPU arrays with @:ALLOCATE:

  • gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, gpu_boundary_v

None are deallocated anywhere, violating the @:ALLOCATE/@:DEALLOCATE pairing rule. Additionally, stl_bounding_boxes and models(:) are allocated with plain allocate but never deallocated.

Fix: Add a s_finalize_model_module routine that @:DEALLOCATEs all GPU arrays, called from s_finalize_ibm_module.


Important Issues

5. error stop used instead of call s_mpi_abort()

src/simulation/m_ibm.fpp, ~line 504:

error stop "Ghost Point and Image Point on Different Processors"

Per project rules, use call s_mpi_abort() instead.

6. Stale error message in case_validator.py

toolchain/mfc/case_validator.py, line 687:

self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
             "num_ibs must be between 1 and num_patches_max (10)")

Limit changed to 1000 but message still says "(10)".

7. Unbounded do while on GPU in s_compute_image_points

src/simulation/m_ibm.fpp, in s_compute_image_points:

The bounds check that prevents index from going out of range is #ifdef-disabled on GPU builds. If an image point falls outside the domain, this creates an infinite loop on the GPU or out-of-bounds memory access.

Fix: Add a bounded iteration limit and exit if exceeded.


Minor Notes

  • Commented-out GPU_ROUTINE directives in f_model_random_number and f_model_is_inside (src/common/m_model.fpp, lines 511, 531) — clarify whether these are intentionally CPU-only.
  • s_find_ghost_points ghost point ordering is nondeterministic on GPU due to atomic capture — acceptable if downstream code is order-insensitive.
  • s_check_boundary uses copy clause on large local arrays — may be slow for large 2D models but only runs at initialization.

Verdict

The GPU offloading approach is correct and the Fypp macro usage follows project conventions. The four critical issues (loop allocation crash, memory inflation, division by zero, missing deallocations) should be fixed before merging. The important issues are lower priority but worth addressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants