Gates caching in group_index_select kernel to only take place for fp16 if batch size is small
#158
+14
−4