Skip to content

add explicit CC parameters to DeviceKernel.compile_and_load#45

Open
hgt312 wants to merge 1 commit intoaws-neuron:mainfrom
hgt312:feat/explicit-cc
Open

add explicit CC parameters to DeviceKernel.compile_and_load#45
hgt312 wants to merge 1 commit intoaws-neuron:mainfrom
hgt312:feat/explicit-cc

Conversation

@hgt312
Copy link
Contributor

@hgt312 hgt312 commented Mar 20, 2026

Description:

  • Add cc_enabled, rank_id, and world_size parameters to DeviceKernel.compile_and_load for explicit collective communication control
  • Support MPMD workloads where each rank traces/compiles independently (no rank-0 broadcast or barrier)
  • Support non-torch-distributed runtimes that manage their own ranks
  • Namespace build directories by rank in explicit CC mode to avoid concurrent write collisions

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ile_and_load

Support MPMD workloads and non-torch-distributed runtimes by allowing
callers to pass CC parameters explicitly. When cc_enabled is set, every
rank traces and compiles independently (no rank-0 broadcast or barrier).
Build directories are namespaced by rank to avoid concurrent write
collisions.

- cc_enabled=None (default): auto-detect from torch.distributed (SPMD)
- cc_enabled=True: explicit CC with per-rank compilation (MPMD)
- cc_enabled=False: disable CC even in distributed settings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hgt312 hgt312 requested a review from a team March 20, 2026 05:08
Copy link
Contributor

@vgene vgene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the idea in the function signature. It's just an idea, open to discussion

else:
mpmd_build_dir = build_dir

if distributed and cc_enabled is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cc_enabled the best name here? Can be confusing since it's for controlling MPMD?

if distributed:
# Resolve CC parameters: explicit args take priority, then torch.distributed.
if cc_enabled is None and distributed:
cc_enabled = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment is quite confusing.

use_cached_if_exists=True,
build_dir=None,
target=CompilationTarget.DEFAULT,
cc_enabled=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea to make the usage more explicit:

  • change the rank_id, and world_size to rank_id_override and world_size_override.
  • change cc_enabled to enable_cc_override?
  • add another is_mpmd flag which default to False

Then, in the main logic

cc_enabled = enable_cc_override or distributed
rank_id = rank_id_override or dist.get_rank()
world_size = dist.get_world_size()

The above controls which core the NEFF is loaded to.

Whether all cores compile different NEFF or all using rank 0's NEFF is only controled by is_mpmd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants