Skip to content

Commit d3298dc

Browse files
committed
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
1 parent 091a46c commit d3298dc

1 file changed

Lines changed: 16 additions & 0 deletions

File tree

ggml/src/ggml-cuda/ggml-cuda.cu

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,22 @@ static int ggml_cuda_parse_id(char devName[]) {
194194
static ggml_cuda_device_info ggml_cuda_init() {
195195
ggml_cuda_device_info info = {};
196196

197+
// Set CUDA_SCALE_LAUNCH_QUEUES before any CUDA API call to improve multi-GPU pipeline parallelism performance
198+
if (getenv("CUDA_SCALE_LAUNCH_QUEUES") == nullptr) {
199+
#ifdef _WIN32
200+
_putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x");
201+
#else
202+
setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set
203+
#endif
204+
205+
GGML_LOG_WARN("\n");
206+
GGML_LOG_WARN("================================================================================\n");
207+
GGML_LOG_WARN(" CUDA_SCALE_LAUNCH_QUEUES=4x has been enabled\n");
208+
GGML_LOG_WARN(" This environment variable improves performance with multiple GPUs\n");
209+
GGML_LOG_WARN("================================================================================\n");
210+
GGML_LOG_WARN("\n");
211+
}
212+
197213
cudaError_t err = cudaGetDeviceCount(&info.device_count);
198214
if (err != cudaSuccess) {
199215
GGML_LOG_ERROR("%s: failed to initialize " GGML_CUDA_NAME ": %s\n", __func__, cudaGetErrorString(err));

0 commit comments

Comments
 (0)