Motivation The current implementation of NMS kernel still have some space to improve. Optimize result (Roughly 1.62 speedup): Metrics improvement (8x occupancy, 2x shared memory thoughput, 3x global memory throughput): Optimization process: