Skip to content

lakinsm/cuda-half-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Testing cuBLAS GEMM Speeds

Using NVIDIA's 1080Ti and 2080Ti on Ubuntu 18.04 LTS, testing cuBLAS SGEMM, HGEMM, and GemmEx algorithms on 20,000 element square matrices. Examples of each are provided.

If you're here from Google, here's some information about NVIDIA's __half as defined in cuda_fp16.h:

  • The minimum and maximum numerical limits for __half are +/- 8,192. I couldn't find this explicitly documented anywhere else.
  • __half2 cannot be used in the cuBLAS GEMM algorithms and is largely defined for custom CUDA kernels, from what I can tell.

The following are the timings for a single iteration of each algorithm on each architecture:

CUDA 6.1 32-bit SGEMM, 400000000 elements per array, 1600 MB memory per array, 3.25769 sec

CUDA 6.1 16-bit HGEMM, 400000000 elements per array, 800 MB memory per array, 74.9325 sec

CUDA 6.1 16-32-bit GemmEx, 400000000 elements per array, 800 and 1600 MB memory for 16-/32-bit arrays, 2.39189 sec

CUDA 7.5 32-bit SGEMM, 400000000 elements per array, 1600 MB memory per array, 3.40309 sec

CUDA 7.5 16-bit HGEMM, 400000000 elements per array, 800 MB memory per array, 1.45896 sec

CUDA 7.5 16-32-bit GemmEx, 400000000 elements per array, 800 and 1600 MB memory for 16-/32-bit arrays, 1.40492 sec

Note that "Fast 16-bit float" compute is not available for the CUDA 6 compute level, except for the P100, which I am not using here. As a result, half precision exclusive GEMM is 20-25x slower than 32-bit or mixed 16-32-bit GEMM for the 1080Ti. Also, even though the GemmEx mixed precision algorithm utilizes 32-bit intermediate data structures, the 2080Ti GemmEx outperforms the HGEMM, perhaps due to use of Tensor cores.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors