Stronger profiling for GPU

## Current implementation

Below is a snippet of code from `src/optimizer/hardware/profiler`

```python
start.record()
for _ in range(10):      # Python Loop
    module.launch(...)   # CPU -> GPU submission
end.record()
# ...
elapsed_ms = start.elapsed_time(end) / 10
```

In this snippet the code, the elapsed time may include idle GPU time. Verify that it does or does not have an effect (this will impact the degree of parallelism that we can impose on the system later)

## Proposed Solution

Using CUDA Graphs may eliminate this because it allows us to record the sequence of 10 launches once and submit them as a single unit. Once submitted, the GPU executes the entire chain without needing the CPU, making your profiling immune to CPU load.  

```python
# 1. Warmup (updates the CUDA context for these specific args)
module.launch(*args, **kwargs)
torch.cuda.synchronize()
# 2. Capture Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    for _ in range(10):
        module.launch(*args, **kwargs)
# 3. Measure Execution of the Graph
start.record()
g.replay()  # Single CPU instruction triggers all 10 runs on GPU
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / 10
```

Above is a new proposed form that may be better suited for our implementation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stronger profiling for GPU #26

Current implementation

Proposed Solution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stronger profiling for GPU #26

Description

Current implementation

Proposed Solution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions