Skip to content

Gates caching in group_index_select kernel to only take place for fp16 if batch size is small#158

Merged
aryaman-gupta merged 2 commits into
abokovoi/group-index-sort-and-cache-optfrom
aryaman/gis-fp32-regression-fix-v2
May 21, 2026
Merged

Gates caching in group_index_select kernel to only take place for fp16 if batch size is small#158
aryaman-gupta merged 2 commits into
abokovoi/group-index-sort-and-cache-optfrom
aryaman/gis-fp32-regression-fix-v2

Conversation

@aryaman-gupta
Copy link
Copy Markdown

No description provided.

Aryaman Gupta and others added 2 commits May 20, 2026 15:17
… + workload size

Compute enable_cache_and_contig_for_bwd in the forward and pass it through
saved_data to the backward. The flag is on for fp16/bf16 unconditionally
(software CAS-loop atomicAdd makes the cache+contig wins large), and for
fp32 only when num_total_indices >= the lower sort threshold (cache+contig
per-warp overhead is otherwise not offset by the savings).

The gating intentionally uses only the lower sort threshold, not the upper
one: when the workload is too large for sort to be worthwhile, cache+contig
still help because the kernel runtime dominates the per-warp overhead.
@aryaman-gupta aryaman-gupta marked this pull request as ready for review May 20, 2026 16:09
Copy link
Copy Markdown

@avbokovoy avbokovoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aryaman-gupta aryaman-gupta merged commit 5844b46 into abokovoi/group-index-sort-and-cache-opt May 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants