Skip to content

get_device_context tensor goes stale if heap_bases change after init #467

@mawad-amd

Description

@mawad-amd

Bug

get_device_context() builds a new torch.tensor from self.heap_bases.tolist() on every call (see #466). Once #466 is fixed by precomputing the tensor in __init__, the context tensor will hold a snapshot of heap_bases at construction time.

If heap_bases were to change after init (e.g., via refresh_peer_access() after a new shmem.allocate() or as_symmetric() call with a future allocator), the precomputed context tensor would contain stale base addresses. Kernels using DeviceContext would translate pointers using wrong bases, causing silent data corruption or hangs.

Today this is not a bug — both the torch and vmem allocators produce stable heap_bases after the first refresh_peer_access(). But it will become one if an allocator ever remaps peer VA ranges.

Fix

After precomputing self._device_context in __init__, add an in-place update in refresh_peer_access():

self._device_context[2:2+self.num_ranks] = self.heap_bases

No allocation, CUDAGraph safe, one line.

Component

iris/iris.py, iris/symmetric_heap.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingirisIris project issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions