-
Notifications
You must be signed in to change notification settings - Fork 268
Description
Summary
clFinish (and zeCommandListHostSynchronize via Level Zero) hangs permanently when total GPU execution time on a single command queue exceeds ~20 seconds. The GPU completes the work — strace shows DRM_IOCTL_I915_GEM_WAIT returning successfully — but the runtime never reports completion.
The threshold is exactly ~20s regardless of the number of dispatches, iteration count, or whether i915 hangcheck is disabled.
Environment
- GPU: Intel Arc A770 (DG2, PCI
0x56a0) - Driver:
intel-opencl-icd 26.05.37020.3-0/libze-intel-gpu1 26.05.37020.3-0 - IGC:
2.28.4 - Kernel:
6.11.0-29-generic - OS: Ubuntu 24.04
Reproducer
Single file, pure OpenCL. Enqueues N dispatches of a spin-loop kernel (~4.4s each) on one queue, then calls clFinish.
g++ -O2 repro.cpp -o repro -lOpenCL
./repro 4 # ~18s total # PASS
./repro 5 # ~22s total # HANGS
// NEO bug: clFinish hangs after ~20s of GPU work on a single queue.
// GPU completes (strace confirms GEM_WAIT returns), but NEO loses track.
// Tested: Arc A770, libze_intel_gpu.so.1.14.37020
//
// Build: g++ -O2 repro.cpp -o repro -lOpenCL
// Run: ./repro 4 # ~18s total GPU work -> PASS
// ./repro 5 # ~22s total GPU work -> HANGS
#include <CL/cl.h>
#include <cstdio>
#include <cstdlib>
#include <sys/time.h>
#define CHK(x) do { cl_int e=(x); if(e) { \
fprintf(stderr, "FAIL: %s returned %d at line %d\n", #x, e, __LINE__); \
exit(1); } } while(0)
static const char *kernel_src =
"__kernel void spin(__global long *out, long iters) {\n"
" long v = 0;\n"
" for (long i = 0; i < iters; i++)\n"
" v += i % 3;\n"
" out[0] = v;\n"
"}\n";
int main(int argc, char **argv) {
int n = argc > 1 ? atoi(argv[1]) : 5;
cl_long iters = 48000000; // ~4.4s per dispatch on Arc A770
printf("%d dispatches x %ld iters (est %.0fs, threshold ~20s)\n",
n, iters, n * 4.4);
// Platform + device
cl_platform_id plat;
cl_device_id dev;
cl_int err;
CHK(clGetPlatformIDs(1, &plat, nullptr));
CHK(clGetDeviceIDs(plat, CL_DEVICE_TYPE_GPU, 1, &dev, nullptr));
char name[256];
clGetDeviceInfo(dev, CL_DEVICE_NAME, sizeof(name), name, nullptr);
printf("Device: %s\n", name);
cl_context ctx = clCreateContext(nullptr, 1, &dev, nullptr, nullptr, &err);
CHK(err);
cl_command_queue queue = clCreateCommandQueue(ctx, dev, 0, &err);
CHK(err);
// Build kernel
cl_program prog = clCreateProgramWithSource(ctx, 1, &kernel_src, nullptr, &err);
CHK(err);
err = clBuildProgram(prog, 1, &dev, nullptr, nullptr, nullptr);
if (err != CL_SUCCESS) {
char log[4096];
clGetProgramBuildInfo(prog, dev, CL_PROGRAM_BUILD_LOG, sizeof(log), log, nullptr);
fprintf(stderr, "Build failed:\n%s\n", log);
exit(1);
}
cl_kernel kern = clCreateKernel(prog, "spin", &err);
CHK(err);
cl_mem buf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, sizeof(cl_long), nullptr, &err);
CHK(err);
CHK(clSetKernelArg(kern, 0, sizeof(buf), &buf));
CHK(clSetKernelArg(kern, 1, sizeof(iters), &iters));
// Enqueue N kernel dispatches
size_t global = 1, local = 1;
for (int i = 0; i < n; i++)
CHK(clEnqueueNDRangeKernel(queue, kern, 1, nullptr, &global, &local, 0, nullptr, nullptr));
// clFinish — hangs when total GPU time exceeds ~20s
struct timeval t0, t1;
gettimeofday(&t0, nullptr);
printf("clFinish...\n");
fflush(stdout);
err = clFinish(queue);
gettimeofday(&t1, nullptr);
long ms = (t1.tv_sec - t0.tv_sec) * 1000 + (t1.tv_usec - t0.tv_usec) / 1000;
printf("err=%d time=%ldms %s\n", err, ms, err == CL_SUCCESS ? "PASS" : "FAIL/HANG");
clReleaseMemObject(buf);
clReleaseKernel(kern);
clReleaseProgram(prog);
clReleaseCommandQueue(queue);
clReleaseContext(ctx);
return err != CL_SUCCESS;
}Observations
- Threshold is ~20.000s — not related to dispatch count. 4×4.4s=17.6s passes, 5×4.4s=22s hangs.
- GPU completes the work —
straceshowsDRM_IOCTL_I915_GEM_WAITreturns successfully, but the host-side sync never completes. - i915 hangcheck is not the cause — disabling all i915 timeouts (
heartbeat_interval_ms=0,preempt_timeout_ms=0,stop_timeout_ms=0) has no effect on the ~20s threshold. - Affects both OpenCL and Level Zero —
zeCommandListHostSynchronizeon an immediate command list exhibits the same behavior. With a finite timeout, it returnsZE_RESULT_NOT_READYpermanently after the ~20s threshold is crossed. - DirectSubmission controller appears to internally declare a GPU hang at the ~20s mark and enters an unrecoverable state, even though the GPU is still executing (or has completed) the work.
Impact
This blocks any workload that runs >20s of GPU work on a single hardware queue. On devices with numQueues=1 (like Arc A770), all command queues/streams serialize onto the same hardware queue, making it easy to hit this threshold with multiple short kernels.
Discovered while debugging chipStar#1191.