Skip to content

clFinish / zeCommandListHostSynchronize hangs after ~20s of GPU work #904

@pvelesko

Description

@pvelesko

Summary

clFinish (and zeCommandListHostSynchronize via Level Zero) hangs permanently when total GPU execution time on a single command queue exceeds ~20 seconds. The GPU completes the work — strace shows DRM_IOCTL_I915_GEM_WAIT returning successfully — but the runtime never reports completion.

The threshold is exactly ~20s regardless of the number of dispatches, iteration count, or whether i915 hangcheck is disabled.

Environment

  • GPU: Intel Arc A770 (DG2, PCI 0x56a0)
  • Driver: intel-opencl-icd 26.05.37020.3-0 / libze-intel-gpu1 26.05.37020.3-0
  • IGC: 2.28.4
  • Kernel: 6.11.0-29-generic
  • OS: Ubuntu 24.04

Reproducer

Single file, pure OpenCL. Enqueues N dispatches of a spin-loop kernel (~4.4s each) on one queue, then calls clFinish.

g++ -O2 repro.cpp -o repro -lOpenCL
./repro 4   # ~18s total # PASS
./repro 5   # ~22s total # HANGS
// NEO bug: clFinish hangs after ~20s of GPU work on a single queue.
// GPU completes (strace confirms GEM_WAIT returns), but NEO loses track.
// Tested: Arc A770, libze_intel_gpu.so.1.14.37020
//
// Build: g++ -O2 repro.cpp -o repro -lOpenCL
// Run:   ./repro 4  # ~18s total GPU work -> PASS
//        ./repro 5  # ~22s total GPU work -> HANGS

#include <CL/cl.h>
#include <cstdio>
#include <cstdlib>
#include <sys/time.h>

#define CHK(x) do { cl_int e=(x); if(e) { \
  fprintf(stderr, "FAIL: %s returned %d at line %d\n", #x, e, __LINE__); \
  exit(1); } } while(0)

static const char *kernel_src =
  "__kernel void spin(__global long *out, long iters) {\n"
  "  long v = 0;\n"
  "  for (long i = 0; i < iters; i++)\n"
  "    v += i % 3;\n"
  "  out[0] = v;\n"
  "}\n";

int main(int argc, char **argv) {
  int n = argc > 1 ? atoi(argv[1]) : 5;
  cl_long iters = 48000000; // ~4.4s per dispatch on Arc A770
  printf("%d dispatches x %ld iters (est %.0fs, threshold ~20s)\n",
         n, iters, n * 4.4);

  // Platform + device
  cl_platform_id plat;
  cl_device_id dev;
  cl_int err;
  CHK(clGetPlatformIDs(1, &plat, nullptr));
  CHK(clGetDeviceIDs(plat, CL_DEVICE_TYPE_GPU, 1, &dev, nullptr));

  char name[256];
  clGetDeviceInfo(dev, CL_DEVICE_NAME, sizeof(name), name, nullptr);
  printf("Device: %s\n", name);

  cl_context ctx = clCreateContext(nullptr, 1, &dev, nullptr, nullptr, &err);
  CHK(err);
  cl_command_queue queue = clCreateCommandQueue(ctx, dev, 0, &err);
  CHK(err);

  // Build kernel
  cl_program prog = clCreateProgramWithSource(ctx, 1, &kernel_src, nullptr, &err);
  CHK(err);
  err = clBuildProgram(prog, 1, &dev, nullptr, nullptr, nullptr);
  if (err != CL_SUCCESS) {
    char log[4096];
    clGetProgramBuildInfo(prog, dev, CL_PROGRAM_BUILD_LOG, sizeof(log), log, nullptr);
    fprintf(stderr, "Build failed:\n%s\n", log);
    exit(1);
  }
  cl_kernel kern = clCreateKernel(prog, "spin", &err);
  CHK(err);

  cl_mem buf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, sizeof(cl_long), nullptr, &err);
  CHK(err);
  CHK(clSetKernelArg(kern, 0, sizeof(buf), &buf));
  CHK(clSetKernelArg(kern, 1, sizeof(iters), &iters));

  // Enqueue N kernel dispatches
  size_t global = 1, local = 1;
  for (int i = 0; i < n; i++)
    CHK(clEnqueueNDRangeKernel(queue, kern, 1, nullptr, &global, &local, 0, nullptr, nullptr));

  // clFinish — hangs when total GPU time exceeds ~20s
  struct timeval t0, t1;
  gettimeofday(&t0, nullptr);
  printf("clFinish...\n");
  fflush(stdout);
  err = clFinish(queue);
  gettimeofday(&t1, nullptr);
  long ms = (t1.tv_sec - t0.tv_sec) * 1000 + (t1.tv_usec - t0.tv_usec) / 1000;
  printf("err=%d  time=%ldms  %s\n", err, ms, err == CL_SUCCESS ? "PASS" : "FAIL/HANG");

  clReleaseMemObject(buf);
  clReleaseKernel(kern);
  clReleaseProgram(prog);
  clReleaseCommandQueue(queue);
  clReleaseContext(ctx);
  return err != CL_SUCCESS;
}

Observations

  1. Threshold is ~20.000s — not related to dispatch count. 4×4.4s=17.6s passes, 5×4.4s=22s hangs.
  2. GPU completes the workstrace shows DRM_IOCTL_I915_GEM_WAIT returns successfully, but the host-side sync never completes.
  3. i915 hangcheck is not the cause — disabling all i915 timeouts (heartbeat_interval_ms=0, preempt_timeout_ms=0, stop_timeout_ms=0) has no effect on the ~20s threshold.
  4. Affects both OpenCL and Level ZerozeCommandListHostSynchronize on an immediate command list exhibits the same behavior. With a finite timeout, it returns ZE_RESULT_NOT_READY permanently after the ~20s threshold is crossed.
  5. DirectSubmission controller appears to internally declare a GPU hang at the ~20s mark and enters an unrecoverable state, even though the GPU is still executing (or has completed) the work.

Impact

This blocks any workload that runs >20s of GPU work on a single hardware queue. On devices with numQueues=1 (like Arc A770), all command queues/streams serialize onto the same hardware queue, making it easy to hit this threshold with multiple short kernels.

Discovered while debugging chipStar#1191.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions