Skip to content

[Triton] DSV4 GEMM changed to mxfp8 GEMM#859

Draft
k50112113 wants to merge 2 commits into
mainfrom
shaoclee/alizaidy-dsv4-mxfp8_gemm
Draft

[Triton] DSV4 GEMM changed to mxfp8 GEMM#859
k50112113 wants to merge 2 commits into
mainfrom
shaoclee/alizaidy-dsv4-mxfp8_gemm

Conversation

@k50112113
Copy link
Copy Markdown
Contributor

This PR adds ATOM_FP8_BLOCKSCALE_USE_MXFP8 env var switch all the projection GEMMs to mxfp8 quantization format

This PR depends on: ROCm/aiter#3286

local-completions ({'model': '/data/deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 2, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.95|±  |0.0219|
|     |       |strict-match    |     3|exact_match|↑  | 0.95|±  |0.0219|
This PR
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  258.17    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              1.98      
Output token throughput (tok/s):         2030.81   
Total Token throughput (tok/s):          4061.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          2058.78   
Median TTFT (ms):                        2190.25   
P99 TTFT (ms):                           4948.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.53     
Median TPOT (ms):                        29.38     
P99 TPOT (ms):                           31.06     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.50     
Median ITL (ms):                         28.69     
P99 ITL (ms):                            30.13     
==================================================

Baseline
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  259.28    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              1.97      
Output token throughput (tok/s):         2022.13   
Total Token throughput (tok/s):          4044.26   
---------------Time to First Token----------------
Mean TTFT (ms):                          1970.23   
Median TTFT (ms):                        2201.64   
P99 TTFT (ms):                           3883.61   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.75     
Median TPOT (ms):                        29.76     
P99 TPOT (ms):                           31.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.72     
Median ITL (ms):                         28.91     
P99 ITL (ms):                            30.15     
==================================================

@k50112113 k50112113 marked this pull request as draft May 21, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant