Current implementation
Below is a snippet of code from src/optimizer/hardware/profiler
start.record()
for _ in range(10): # Python Loop
module.launch(...) # CPU -> GPU submission
end.record()
# ...
elapsed_ms = start.elapsed_time(end) / 10
In this snippet the code, the elapsed time may include idle GPU time. Verify that it does or does not have an effect (this will impact the degree of parallelism that we can impose on the system later)
Proposed Solution
Using CUDA Graphs may eliminate this because it allows us to record the sequence of 10 launches once and submit them as a single unit. Once submitted, the GPU executes the entire chain without needing the CPU, making your profiling immune to CPU load.
# 1. Warmup (updates the CUDA context for these specific args)
module.launch(*args, **kwargs)
torch.cuda.synchronize()
# 2. Capture Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
for _ in range(10):
module.launch(*args, **kwargs)
# 3. Measure Execution of the Graph
start.record()
g.replay() # Single CPU instruction triggers all 10 runs on GPU
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / 10
Above is a new proposed form that may be better suited for our implementation.
Current implementation
Below is a snippet of code from
src/optimizer/hardware/profilerIn this snippet the code, the elapsed time may include idle GPU time. Verify that it does or does not have an effect (this will impact the degree of parallelism that we can impose on the system later)
Proposed Solution
Using CUDA Graphs may eliminate this because it allows us to record the sequence of 10 launches once and submit them as a single unit. Once submitted, the GPU executes the entire chain without needing the CPU, making your profiling immune to CPU load.
Above is a new proposed form that may be better suited for our implementation.