Port tracing capabilities to Gluon IrisDeviceCtx API#473
Port tracing capabilities to Gluon IrisDeviceCtx API#473
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/3068c30a-5cdf-47fd-b543-fe83d4d7c613
|
@copilot — Review: Bugs found while integrating gluon tracing with the all-gather kernel I tested this PR by integrating it with the gluon all-gather kernel on MI355X hardware. Several bugs prevented it from working correctly with 1.
|
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/c02aed93-76db-4b7d-9e2e-9807ec681118
Fixed in d8fba1d. Here's what was addressed:
The |
mawad-amd
left a comment
There was a problem hiding this comment.
Review: Several issues to fix + needs hardware testing
Critical Issues
-
Type mismatches on stores:
cur_rankis loaded as int64 from the context tensor (gl.loadreturns int64) but gets stored tobuf_cur_rankwhich ispointer_type(tl.int32). Same for other int32 buffers receiving int64 values. Add explicittl.cast(value, tl.int32)before storing to int32 buffer pointers. Affected stores inrecord_event_start:self.rank→buf_cur_rank(int64 → int32)target_rank→buf_target_rank(likely int64 → int32)pid_m,pid_n→ their buffers (check types)gl.program_id(0)→buf_pid(check if gl.program_id returns int32 or int64)
-
Missing
.item()on bounds check: The Triton referenceDeviceTracingusesevent_idx.item() < self.max_events.item()for scalar comparison in theifguard. Your version usesevent_idx < self.max_eventswhich may produce a tensor bool instead of a Python-level scalar bool, causing theifbranch to behave incorrectly. Check how Gluon handles this — you may need.item()or explicit scalar extraction. -
max_eventstype inconsistency: In the enabled path,max_eventsis loaded as int64 from the context tensor. In the disabled path, you createmax_events_zero = tl.cast(0, tl.int32). The aggregate fieldmax_eventshas inconsistent types between the two paths. Either cast the loaded value to int32 or use int64 consistently. -
tl.cast(0, tl.int32)vstl.full((), 0, dtype=tl.int32): The Triton reference usestl.full((), 0, dtype=...)for creating scalar zero values. Your version usestl.cast(0, ...). While likely equivalent, match the reference pattern for consistency:tl.full((), 0, dtype=tl.int32).
Medium Issues
- Tests only verify counter, never check event data:
test_device_context_gluon_tracing_1d_addressonly assertstrace_counter.item() >= 1. It should also read back trace buffers and verify:event_idmatchesTraceEvent().putcur_rankmatchessource_ranktarget_rankmatches(source_rank + 1) % num_rankstimestampis non-zeroduration_cycles(end timestamp) >=timestamp(start timestamp)payload_size=BLOCK_SIZE * 8(4 elements × 8 bytes for int64)
Hardware Testing Required
You are running on an AMD GPU CI runner (MI325X). Please:
-
Run the existing test:
python -m pytest tests/unittests/test_device_context_gluon.py -v— confirm it passes on real hardware. -
Write and run a no-overhead test: Create a simple Gluon kernel that does a basic operation (e.g., vector add or copy). Run it with
tracing=Falseandtracing=True(with tracing disabled on the host, i.e. don't callshmem.tracing.enable()). Compare execution times over 100+ iterations. Withtracing=False(constexpr), the tracing code should be dead-code-eliminated by the compiler — confirm there is zero overhead. -
Write and run a tracing correctness test: Enable tracing, run a kernel that does a known operation, export the trace, and verify the exported JSON contains the expected events with correct fields.
Fix the critical issues first, then run the tests to confirm everything works on hardware.
… test assertions Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/0326dad4-e25d-48f2-a6a8-4492a710ae36
Fixed in 4729f60. Here's what was addressed: Critical fixes:
Test enhancement: |
|
@copilot — Next step: Add a gluon tracing example and verify zero overhead You are running on AMD GPU hardware. Use it to build, test, and verify everything below. 1. Create a gluon tracing exampleAdd a new example directory (e.g.,
2. Verify it works with AND without tracingThe same kernel must compile and run correctly with 3. Zero-overhead verification via assembly comparisonThis is the most important part. Write a minimal gluon kernel (e.g., a simple remote store) in two versions:
Compile both and dump the generated assembly (AMDGCN ISA). Compare them. They should be identical — the You can get the assembly via Triton's compilation utilities or by inspecting the kernel cache. You have AMD GPUs available — use them. 4. Report backPost your results here:
|
… example - Remove @gluon.constexpr_function from _GluonDeviceTracingCls.__init__ — it explicitly unwraps tl.constexpr args via _unwrap_if_constexpr, stripping the enabled flag before self.enabled = enabled; keep __triton_builtin__ = True (matches Triton DeviceTracing pattern exactly) - Fix tl.sum() over BlockedLayout 1D tensor — Gluon reduce_op.verify() fails; use mask.shape[0] (compile-time shape) for payload_size calculation instead - Fix tl.min() over BlockedLayout 1D tensor — same issue; store 0 as placeholder (documented known Gluon limitation in comment) - Fix test BLOCK_SIZE=4 → BLOCK_SIZE=64 — BlockedLayout([1],[4],[1],[0]) has 4 threads/warp which is invalid for AMD GPUs (wavefront size = 64) - Add examples/32_gluon_all_gather_tracing/all_gather_tracing.py: ring all-gather with tracing=True/False, 256 events recorded on hardware, all-reduce validation across ranks Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/8c09fd7d-c3f6-4e29-85dc-b05f5b639fb3
Done in 3ffff41. Here's a summary: 1. Gluon all-gather tracing example (
|
|
@copilot — Fix required: Your commit claims @gluon.jit
def test_addr_min(buf_ptr, out_addr, out_sum_mask):
BLOCK: gl.constexpr = 256
layout: gl.constexpr = gl.BlockedLayout([1], [64], [4], [0])
offsets = gl.arange(0, BLOCK, layout=layout)
# tl.min on pointer-cast-to-int64 — WORKS
ptrs = buf_ptr + offsets
addr_i64 = tl.cast(ptrs, gl.int64)
min_addr = gl.min(addr_i64, axis=0) # returns correct minimum address
gl.store(out_addr, min_addr)
# tl.sum on partial mask — WORKS
mask = offsets < 200
mask_i32 = tl.cast(mask, tl.int32)
num_true = gl.sum(mask_i32, axis=0) # returns 200, correct
gl.store(out_sum_mask, num_true)Results: Required changes:
addr_i64 = tl.cast(address, gl.int64)
tl.store(self.buf_address + event_idx, gl.min(addr_i64, axis=0))
mask_i32 = tl.cast(mask, tl.int32)
num_elements = gl.sum(mask_i32, axis=0)This matches the Triton Still needed from previous review:
|
…uonDeviceTracing Hardware testing confirmed gl.sum(tensor, axis=0) and gl.min(tensor, axis=0) work correctly on 1D BlockedLayout tensors on AMD GPUs. Replaces the incorrect workarounds: - payload_size: mask.shape[0] → gl.sum(tl.cast(mask, tl.int32), axis=0) - buf_address: store 0 → gl.min(tl.cast(address, tl.int64), axis=0) Also verified via assembly analysis that tracing=False produces identical machine code to a kernel with no tracing code at all (diff contains only function name labels and debug .loc directives, zero instruction differences). Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/32b89318-c8dd-4922-aedb-8b90bb6f88cb
Fixed in 1eebadb. 1 & 2.
|
mawad-amd
left a comment
There was a problem hiding this comment.
Verified on MI355X hardware:
- Unit tests: 3/3 passed (torchrun --nproc_per_node=4)
- Example: tracing=True works end-to-end (256 events, JSON export, Perfetto-compatible), tracing=False confirmed zero overhead (0.030ms vs 0.031ms)
- Code: gl.min/gl.sum correctly used for address and payload_size, all int32 casts in place, constexpr dead-code elimination confirmed via assembly diff
Minor note: unit tests need torchrun to run (no skipif not dist.is_initialized() guard) — would be nice to add but not blocking.
Squash before merge recommended (6 commits → 1).
IrisDeviceCtx(Gluon API) lacked tracing support available in the TritonDeviceContext, making it impossible to profile Gluon kernels with the same instrumentation.Changes
New:
GluonDeviceTracingdevice-side aggregate (iris/experimental/iris_gluon.py)DeviceTracingusing@gluon.jit;__init__uses only__triton_builtin__ = True(no@gluon.constexpr_function) soenabled: tl.constexpris preserved through the aggregate constructordevice_utils.read_realtime(),get_xcc_id(),get_cu_id()for hardware timestampspayload_sizecomputed viagl.sum(tl.cast(mask, tl.int32), axis=0)— exact element count matching the Triton versionbuf_addressrecorded viagl.min(tl.cast(address, tl.int64), axis=0)— exact minimum address matching the Triton versionint32trace buffers use explicittl.cast(..., tl.int32)to handleint64values fromgl.loadUpdated:
IrisDeviceCtx.initialize()tracing: gl.constexpr = Falseparametertracing=True, decodes tracing buffer pointers from context tensor via runtime pointer arithmetictracing=False, constructs a disabledGluonDeviceTracingwith dummy pointers (zero overhead, dead code eliminated at compile time)Updated:
IrisGluonhost classself.tracing = Tracing(self)— reuses the existing host-sideTracingclass (buffer allocation,reset(),export())_build_device_context()now encodes tracing buffer pointers when enabled, using the same layout asIris._build_device_context()Updated:
tests/unittests/test_device_context_gluon.pytest_device_context_tracing_1d_addressto Gluon withBLOCK_SIZE=64(AMD wavefront size —BlockedLayout([1], [N], [1], [0])requiresN=64on AMD GPUs)event_id,cur_rank,target_rank,timestamp > 0,duration_cycles >= timestamp,payload_size == BLOCK_SIZE * 8tracing.enable()New:
examples/32_gluon_all_gather_tracing/all_gather_tracing.pyIrisDeviceCtx.initialize(context_tensor, tracing=TRACING)whereTRACINGis agl.constexprTRACING=False(zero-overhead, tracing code dead-code-eliminated) andTRACING=True(full event recording) modesBlockedLayout([SPT], [64], [NUM_WARPS], [0])whereBLOCK_SIZE = SPT * 64 * NUM_WARPSdist.all_reduceacross all ranksZero-Overhead Verification
Assembly comparison between a kernel with
tracing=Falseand an identical kernel with no tracing code at all confirms zero instruction differences — only function name labels and.locdebug metadata differ. Thetracing=Falseconstexpr path is completely dead-code-eliminated by the Gluon/Triton compiler:TRACING=False: 539 ASM lines, 0 timestamp instructions, 0 tracing storesTRACING=True: 973 ASM lines, 3×s_memrealtime, 17×global_storefor trace buffersUsage
Run the end-to-end example:
Original prompt
💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.