[Feature]: Integrating Ultra low latency TopK for into flash-infer-rocm for DeepSeek V3.2

### Suggestion Description

Here we present a low latency oriented topk kernel with on-chip network facility:

https://github.com/yiakwy-xpu-ml-framework-team/flash-float-jit-kernels

In this kernel we resolve latency problem in long context decoding with TopK indexer and NSA (DS3.2). The kernel  reduce the latency by half for low batch size workloads. 

### Operating System

Ubuntu-24.04

### GPU

MI355

### ROCm Component

flashinfer-rocm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Integrating Ultra low latency TopK for into flash-infer-rocm for DeepSeek V3.2 #217

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Integrating Ultra low latency TopK for into flash-infer-rocm for DeepSeek V3.2 #217

Description

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions