clFinish / zeCommandListHostSynchronize hangs after ~20s of GPU work

### Summary

`clFinish` (and `zeCommandListHostSynchronize` via Level Zero) hangs permanently when total GPU execution time on a single command queue exceeds ~20 seconds. The GPU completes the work — `strace` shows `DRM_IOCTL_I915_GEM_WAIT` returning successfully — but the runtime never reports completion.

The threshold is exactly ~20s regardless of the number of dispatches, iteration count, or whether i915 hangcheck is disabled.

### Environment

- **GPU**: Intel Arc A770 (DG2, PCI `0x56a0`)
- **Driver**: `intel-opencl-icd 26.05.37020.3-0` / `libze-intel-gpu1 26.05.37020.3-0`
- **IGC**: `2.28.4`
- **Kernel**: `6.11.0-29-generic`
- **OS**: Ubuntu 24.04

### Reproducer

Single file, pure OpenCL. Enqueues N dispatches of a spin-loop kernel (~4.4s each) on one queue, then calls `clFinish`.

```
g++ -O2 repro.cpp -o repro -lOpenCL
./repro 4   # ~18s total # PASS
./repro 5   # ~22s total # HANGS
```

```cpp
// NEO bug: clFinish hangs after ~20s of GPU work on a single queue.
// GPU completes (strace confirms GEM_WAIT returns), but NEO loses track.
// Tested: Arc A770, libze_intel_gpu.so.1.14.37020
//
// Build: g++ -O2 repro.cpp -o repro -lOpenCL
// Run:   ./repro 4  # ~18s total GPU work -> PASS
//        ./repro 5  # ~22s total GPU work -> HANGS

#include <CL/cl.h>
#include <cstdio>
#include <cstdlib>
#include <sys/time.h>

#define CHK(x) do { cl_int e=(x); if(e) { \
  fprintf(stderr, "FAIL: %s returned %d at line %d\n", #x, e, __LINE__); \
  exit(1); } } while(0)

static const char *kernel_src =
  "__kernel void spin(__global long *out, long iters) {\n"
  "  long v = 0;\n"
  "  for (long i = 0; i < iters; i++)\n"
  "    v += i % 3;\n"
  "  out[0] = v;\n"
  "}\n";

int main(int argc, char **argv) {
  int n = argc > 1 ? atoi(argv[1]) : 5;
  cl_long iters = 48000000; // ~4.4s per dispatch on Arc A770
  printf("%d dispatches x %ld iters (est %.0fs, threshold ~20s)\n",
         n, iters, n * 4.4);

  // Platform + device
  cl_platform_id plat;
  cl_device_id dev;
  cl_int err;
  CHK(clGetPlatformIDs(1, &plat, nullptr));
  CHK(clGetDeviceIDs(plat, CL_DEVICE_TYPE_GPU, 1, &dev, nullptr));

  char name[256];
  clGetDeviceInfo(dev, CL_DEVICE_NAME, sizeof(name), name, nullptr);
  printf("Device: %s\n", name);

  cl_context ctx = clCreateContext(nullptr, 1, &dev, nullptr, nullptr, &err);
  CHK(err);
  cl_command_queue queue = clCreateCommandQueue(ctx, dev, 0, &err);
  CHK(err);

  // Build kernel
  cl_program prog = clCreateProgramWithSource(ctx, 1, &kernel_src, nullptr, &err);
  CHK(err);
  err = clBuildProgram(prog, 1, &dev, nullptr, nullptr, nullptr);
  if (err != CL_SUCCESS) {
    char log[4096];
    clGetProgramBuildInfo(prog, dev, CL_PROGRAM_BUILD_LOG, sizeof(log), log, nullptr);
    fprintf(stderr, "Build failed:\n%s\n", log);
    exit(1);
  }
  cl_kernel kern = clCreateKernel(prog, "spin", &err);
  CHK(err);

  cl_mem buf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, sizeof(cl_long), nullptr, &err);
  CHK(err);
  CHK(clSetKernelArg(kern, 0, sizeof(buf), &buf));
  CHK(clSetKernelArg(kern, 1, sizeof(iters), &iters));

  // Enqueue N kernel dispatches
  size_t global = 1, local = 1;
  for (int i = 0; i < n; i++)
    CHK(clEnqueueNDRangeKernel(queue, kern, 1, nullptr, &global, &local, 0, nullptr, nullptr));

  // clFinish — hangs when total GPU time exceeds ~20s
  struct timeval t0, t1;
  gettimeofday(&t0, nullptr);
  printf("clFinish...\n");
  fflush(stdout);
  err = clFinish(queue);
  gettimeofday(&t1, nullptr);
  long ms = (t1.tv_sec - t0.tv_sec) * 1000 + (t1.tv_usec - t0.tv_usec) / 1000;
  printf("err=%d  time=%ldms  %s\n", err, ms, err == CL_SUCCESS ? "PASS" : "FAIL/HANG");

  clReleaseMemObject(buf);
  clReleaseKernel(kern);
  clReleaseProgram(prog);
  clReleaseCommandQueue(queue);
  clReleaseContext(ctx);
  return err != CL_SUCCESS;
}
```

### Observations

1. **Threshold is ~20.000s** — not related to dispatch count. 4×4.4s=17.6s passes, 5×4.4s=22s hangs.
2. **GPU completes the work** — `strace` shows `DRM_IOCTL_I915_GEM_WAIT` returns successfully, but the host-side sync never completes.
3. **i915 hangcheck is not the cause** — disabling all i915 timeouts (`heartbeat_interval_ms=0`, `preempt_timeout_ms=0`, `stop_timeout_ms=0`) has no effect on the ~20s threshold.
4. **Affects both OpenCL and Level Zero** — `zeCommandListHostSynchronize` on an immediate command list exhibits the same behavior. With a finite timeout, it returns `ZE_RESULT_NOT_READY` permanently after the ~20s threshold is crossed.
5. **DirectSubmission controller** appears to internally declare a GPU hang at the ~20s mark and enters an unrecoverable state, even though the GPU is still executing (or has completed) the work.

### Impact

This blocks any workload that runs >20s of GPU work on a single hardware queue. On devices with `numQueues=1` (like Arc A770), all command queues/streams serialize onto the same hardware queue, making it easy to hit this threshold with multiple short kernels.

Discovered while debugging [chipStar#1191](https://github.com/CHIP-SPV/chipStar/issues/1191).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clFinish / zeCommandListHostSynchronize hangs after ~20s of GPU work #904

Summary

Environment

Reproducer

Observations

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clFinish / zeCommandListHostSynchronize hangs after ~20s of GPU work #904

Description

Summary

Environment

Reproducer

Observations

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions