Skip to content

conv1d gemm with cm#6587

Merged
nihui merged 8 commits intoTencent:masterfrom
futz12:conv1d-opt
May 9, 2026
Merged

conv1d gemm with cm#6587
nihui merged 8 commits intoTencent:masterfrom
futz12:conv1d-opt

Conversation

@futz12
Copy link
Copy Markdown
Contributor

@futz12 futz12 commented Mar 8, 2026

No description provided.

@nihui nihui closed this Mar 8, 2026
@nihui nihui reopened this Mar 8, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 8, 2026

Codecov Report

❌ Patch coverage is 47.49263% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.52%. Comparing base (8c18666) to head (4e1fedc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/vulkan/convolution1d_vulkan.cpp 47.49% 178 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6587      +/-   ##
==========================================
- Coverage   93.59%   93.52%   -0.07%     
==========================================
  Files         933      933              
  Lines      299557   299971     +414     
==========================================
+ Hits       280360   280554     +194     
- Misses      19197    19417     +220     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@futz12 futz12 changed the title conv1d gemm [WIP]conv1d gemm Mar 8, 2026
@futz12
Copy link
Copy Markdown
Contributor Author

futz12 commented Apr 2, 2026

@nihui run ci please ^_^

@tencent-adm
Copy link
Copy Markdown
Member

tencent-adm commented Apr 2, 2026

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ futz12
❌ nihui
You have signed the CLA already but the status is still pending? Let us recheck it.

@futz12 futz12 changed the title [WIP]conv1d gemm conv1d gemm with gemm & cm Apr 2, 2026
@futz12 futz12 changed the title conv1d gemm with gemm & cm conv1d gemm with cm Apr 2, 2026
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 2, 2026

  ---
  PR Overview

  This PR adds GEMM-based and 1x1s1d1 optimized paths for Convolution1D on the Vulkan backend, with cooperative matrix (CM) support. It modifies convolution1d_vulkan.cpp/.h and adds 4 new shader files.

  ---
  Potential Logic Errors

  1. Missing assertion / no fallback when cooperative matrix query returns zeros

  In both the GEMM and 1x1s1d1 CM paths:
  // assert coopmat_M != 0 && coopmat_N != 0 && coopmat_K != 0
  This is only a comment, not an actual assertion or check. If get_optimal_cooperative_matrix_mnk returns zeros (e.g., no supported configuration), the code proceeds to divide by zero in UNROLL_SG_* calculations and creates invalid
  pipelines. This should be a real check with a fallback to the non-CM path.

  2. Weight repacking in the GEMM CM path uses weight_data directly when elempack != 4

  if (elempack == 4) {
      // repack from maxk-inch-outch to 4-maxk-inch/4-outch
      weight_data_r2.create(4 * maxk * (num_input / 4) * num_output);
      ...
  } else {
      weight_data_r2 = weight_data;  // flat, not reshaped yet
  }
  // from maxk-inch-outch to inch-maxk-outch
  Mat weight_data_r2_r = weight_data_r2.reshape(maxk, num_input, num_output);

  When elempack == 4, weight_data_r2 is created with dimensions 4 * maxk * (num_input / 4) * num_output which equals maxk * num_input * num_output -- the same total element count. But the memory layout has been changed to
  4-maxk-inch/4-outch, so the subsequent .reshape(maxk, num_input, num_output) interprets this rearranged data as if it were still in the original layout. The reshape doesn't re-interpret correctly after the repack -- the data is in
  [outch][inch/4][maxk][4] order but reshape treats it as [outch][inch][maxk]. This is likely a data layout mismatch bug that would produce incorrect weight values in the CM path when elempack == 4.

  3. Hardcoded size = 1024 for cooperative matrix tile calculation

  int size = 1024;  // used for UNROLL_SG_M and UNROLL_WG_M

  This is used as a proxy for the spatial dimension (outw) in the M-dimension tile calculations. However, the actual output width (outw) is only known at forward time, not create_pipeline time. This means:
  - If outw is much smaller than 1024, the unrolling factors will be too aggressive, causing many wasted threads processing out-of-bounds elements.
  - If outw is much larger than 1024, the tile factors might be suboptimal (though capped at 2, so less harmful).

  This is arguably a design trade-off, but the magic number should at minimum be documented.

  4. elempack calculated differently at create_pipeline vs. forward time

  In create_pipeline:
  int elempack = num_input % 4 == 0 ? 4 : 1;

  In forward:
  int elempack = bottom_blob.elempack;  // comes from the actual input blob

  Then num_input in forward is:
  const int num_input = bottom_blob_bordered.h * elempack;

  If the runtime elempack differs from the one assumed at pipeline creation, the weight layout and specialization constants will be inconsistent with the actual data layout, producing incorrect results. There's no guard checking that
  the runtime elempack matches expectations.

  5. Non-CM GEMM path: dispatcher dimensions don't match workgroup logic

  // Non-CM GEMM forward:
  VkMat dispatcher;
  dispatcher.w = (top_blob.w + 3) / 4;
  dispatcher.h = num_output_packed / 4;
  dispatcher.c = 1;

  But the local size was set as:
  pipeline_convolution1d_gemm->set_local_size_xyz(8, 8, 1);
  // or
  pipeline_convolution1d_gemm->set_local_size_xyz(16, std::min(4, num_output_packed / 4), 1);

  The dispatcher divides top_blob.w by 4 (rounding up), which implies each workgroup processes 4 output elements spatially. But the local size in x is 8 or 16 -- this means each thread covers a different fraction of the spatial
  dimension. Whether this matches the shader logic is unclear without the shader, but the asymmetry between the +3)/4 dispatch and local sizes of 8/16 is suspicious and may cause either redundant computation or missing coverage.

  6. Bindings inconsistency in non-CM paths

  // Non-CM GEMM forward:
  std::vector<VkMat> bindings(6);
  bindings[0] = bottom_blob_bordered;
  bindings[1] = top_blob;
  bindings[2] = bottom_blob_bordered;  // duplicate of bindings[0]
  bindings[3] = top_blob;              // duplicate of bindings[1]
  bindings[4] = weight_data_gpu;
  bindings[5] = bias_data_gpu;

  Bindings 0-1 and 2-3 are duplicated. This is a common pattern in ncnn for supporting different storage types (image vs. buffer), but it's worth verifying these 6 bindings match what the convolution1d_packed_gemm shader expects. If the
   shader only declares 4 descriptor bindings, this would be a mismatch.

  ---
  Suboptimal Implementation Issues

  7. Massive code duplication between GEMM and 1x1s1d1 CM paths

  The cooperative matrix pipeline creation code for GEMM and 1x1s1d1 is nearly identical (~100 lines each):
  - Same UNROLL calculation logic
  - Same weight packing loop structure (with minor dimension differences)
  - Same specialization constant setup
  - Same pipeline creation

  This should be refactored into a shared helper function with parameters for the dimension differences.

  8. Weight packing has O(n^4) nested loops with poor cache locality

  In the CM weight packing:
  for (int q = 0; q < num_output_packed; q += coopmat_N * UNROLL_SG_N * UNROLL_WG_N) {
      for (int p = 0; p < num_input_packed * maxk; p += coopmat_K * UNROLL_SG_K) {
          for (int n = 0; n < coopmat_N * UNROLL_SG_N * UNROLL_WG_N; n++) {
              for (int k = 0; k < coopmat_K * UNROLL_SG_K; k++) {
                  // random access into weight_data_r2_r via .channel().row()[]

  The innermost loop accesses weight_data_r2_r.channel(q + n).row(w)[kx] which involves channel + row pointer arithmetic on every iteration. This has poor spatial locality and could be significantly improved by hoisting the channel/row
  lookups.

  9. Redundant num_input_packed / num_output_packed calculations

  These are computed identically ((x + 3) / 4 * 4) in multiple branches. They should be computed once at the top of create_pipeline and reused.

  10. No UNROLL_WG_K parameter

  The unrolling scheme defines UNROLL_WG_M and UNROLL_WG_N for the workgroup level but only UNROLL_SG_K for the K dimension. There's no UNROLL_WG_K. This is asymmetric -- on hardware where the K dimension is large (e.g., large num_input
   * maxk), having a workgroup-level K unroll could improve occupancy. This may be intentional but limits performance for K-dominant configs.

  11. destroy_pipeline resets cooperative matrix state unnecessarily

  use_cooperative_matrix = false;
  coopmat_M = 0; coopmat_N = 0; coopmat_K = 0;
  ...
  UNROLL_SG_M = 1; UNROLL_SG_N = 1; ...

  These member variables are always fully set in create_pipeline. Resetting them in destroy_pipeline is harmless but unnecessary and adds maintenance burden.

  12. Missing elempack == 8 support

  The code only handles elempack of 1 or 4. Many modern Vulkan implementations support 8-wide packing. This limits performance on hardware that could benefit from wider packing.

  ---
  Summary

  ┌──────┬──────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
  │      │ Severity │ Issue                                                                                    │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ High     │ No real check when coopmat MNK query returns 0 -- division by zero risk                  │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ High     │ Weight repack for elempack==4 in GEMM CM path likely produces wrong layout after reshape │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 3    │ Medium   │ Hardcoded size=1024 may cause suboptimal tiling for very small/large spatial dims        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 4    │ Medium   │ elempack mismatch between create_pipeline and forward has no runtime guard               │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 5    │ Medium   │ Dispatcher vs. local_size arithmetic mismatch in non-CM GEMM path                        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 6    │ Low      │ Duplicate bindings should be verified against shader descriptor layout                   │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 7    │ Low      │ Significant code duplication between GEMM and 1x1s1d1 CM paths                           │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 8    │ Low      │ Poor cache locality in CM weight packing inner loops                                     │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 9-12 │ Low      │ Minor: redundant calculations, missing UNROLL_WG_K, unnecessary resets, no elempack=8    │
  └──────┴──────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

  Issues #1 and #2 are likely correctness bugs that should be addressed before merge.

@futz12
Copy link
Copy Markdown
Contributor Author

futz12 commented Apr 13, 2026

  ---
  PR Overview

  This PR adds GEMM-based and 1x1s1d1 optimized paths for Convolution1D on the Vulkan backend, with cooperative matrix (CM) support. It modifies convolution1d_vulkan.cpp/.h and adds 4 new shader files.

  ---
  Potential Logic Errors

  1. Missing assertion / no fallback when cooperative matrix query returns zeros

  In both the GEMM and 1x1s1d1 CM paths:
  // assert coopmat_M != 0 && coopmat_N != 0 && coopmat_K != 0
  This is only a comment, not an actual assertion or check. If get_optimal_cooperative_matrix_mnk returns zeros (e.g., no supported configuration), the code proceeds to divide by zero in UNROLL_SG_* calculations and creates invalid
  pipelines. This should be a real check with a fallback to the non-CM path.

  2. Weight repacking in the GEMM CM path uses weight_data directly when elempack != 4

  if (elempack == 4) {
      // repack from maxk-inch-outch to 4-maxk-inch/4-outch
      weight_data_r2.create(4 * maxk * (num_input / 4) * num_output);
      ...
  } else {
      weight_data_r2 = weight_data;  // flat, not reshaped yet
  }
  // from maxk-inch-outch to inch-maxk-outch
  Mat weight_data_r2_r = weight_data_r2.reshape(maxk, num_input, num_output);

  When elempack == 4, weight_data_r2 is created with dimensions 4 * maxk * (num_input / 4) * num_output which equals maxk * num_input * num_output -- the same total element count. But the memory layout has been changed to
  4-maxk-inch/4-outch, so the subsequent .reshape(maxk, num_input, num_output) interprets this rearranged data as if it were still in the original layout. The reshape doesn't re-interpret correctly after the repack -- the data is in
  [outch][inch/4][maxk][4] order but reshape treats it as [outch][inch][maxk]. This is likely a data layout mismatch bug that would produce incorrect weight values in the CM path when elempack == 4.

  3. Hardcoded size = 1024 for cooperative matrix tile calculation

  int size = 1024;  // used for UNROLL_SG_M and UNROLL_WG_M

  This is used as a proxy for the spatial dimension (outw) in the M-dimension tile calculations. However, the actual output width (outw) is only known at forward time, not create_pipeline time. This means:
  - If outw is much smaller than 1024, the unrolling factors will be too aggressive, causing many wasted threads processing out-of-bounds elements.
  - If outw is much larger than 1024, the tile factors might be suboptimal (though capped at 2, so less harmful).

  This is arguably a design trade-off, but the magic number should at minimum be documented.

  4. elempack calculated differently at create_pipeline vs. forward time

  In create_pipeline:
  int elempack = num_input % 4 == 0 ? 4 : 1;

  In forward:
  int elempack = bottom_blob.elempack;  // comes from the actual input blob

  Then num_input in forward is:
  const int num_input = bottom_blob_bordered.h * elempack;

  If the runtime elempack differs from the one assumed at pipeline creation, the weight layout and specialization constants will be inconsistent with the actual data layout, producing incorrect results. There's no guard checking that
  the runtime elempack matches expectations.

  5. Non-CM GEMM path: dispatcher dimensions don't match workgroup logic

  // Non-CM GEMM forward:
  VkMat dispatcher;
  dispatcher.w = (top_blob.w + 3) / 4;
  dispatcher.h = num_output_packed / 4;
  dispatcher.c = 1;

  But the local size was set as:
  pipeline_convolution1d_gemm->set_local_size_xyz(8, 8, 1);
  // or
  pipeline_convolution1d_gemm->set_local_size_xyz(16, std::min(4, num_output_packed / 4), 1);

  The dispatcher divides top_blob.w by 4 (rounding up), which implies each workgroup processes 4 output elements spatially. But the local size in x is 8 or 16 -- this means each thread covers a different fraction of the spatial
  dimension. Whether this matches the shader logic is unclear without the shader, but the asymmetry between the +3)/4 dispatch and local sizes of 8/16 is suspicious and may cause either redundant computation or missing coverage.

  6. Bindings inconsistency in non-CM paths

  // Non-CM GEMM forward:
  std::vector<VkMat> bindings(6);
  bindings[0] = bottom_blob_bordered;
  bindings[1] = top_blob;
  bindings[2] = bottom_blob_bordered;  // duplicate of bindings[0]
  bindings[3] = top_blob;              // duplicate of bindings[1]
  bindings[4] = weight_data_gpu;
  bindings[5] = bias_data_gpu;

  Bindings 0-1 and 2-3 are duplicated. This is a common pattern in ncnn for supporting different storage types (image vs. buffer), but it's worth verifying these 6 bindings match what the convolution1d_packed_gemm shader expects. If the
   shader only declares 4 descriptor bindings, this would be a mismatch.

  ---
  Suboptimal Implementation Issues

  7. Massive code duplication between GEMM and 1x1s1d1 CM paths

  The cooperative matrix pipeline creation code for GEMM and 1x1s1d1 is nearly identical (~100 lines each):
  - Same UNROLL calculation logic
  - Same weight packing loop structure (with minor dimension differences)
  - Same specialization constant setup
  - Same pipeline creation

  This should be refactored into a shared helper function with parameters for the dimension differences.

  8. Weight packing has O(n^4) nested loops with poor cache locality

  In the CM weight packing:
  for (int q = 0; q < num_output_packed; q += coopmat_N * UNROLL_SG_N * UNROLL_WG_N) {
      for (int p = 0; p < num_input_packed * maxk; p += coopmat_K * UNROLL_SG_K) {
          for (int n = 0; n < coopmat_N * UNROLL_SG_N * UNROLL_WG_N; n++) {
              for (int k = 0; k < coopmat_K * UNROLL_SG_K; k++) {
                  // random access into weight_data_r2_r via .channel().row()[]

  The innermost loop accesses weight_data_r2_r.channel(q + n).row(w)[kx] which involves channel + row pointer arithmetic on every iteration. This has poor spatial locality and could be significantly improved by hoisting the channel/row
  lookups.

  9. Redundant num_input_packed / num_output_packed calculations

  These are computed identically ((x + 3) / 4 * 4) in multiple branches. They should be computed once at the top of create_pipeline and reused.

  10. No UNROLL_WG_K parameter

  The unrolling scheme defines UNROLL_WG_M and UNROLL_WG_N for the workgroup level but only UNROLL_SG_K for the K dimension. There's no UNROLL_WG_K. This is asymmetric -- on hardware where the K dimension is large (e.g., large num_input
   * maxk), having a workgroup-level K unroll could improve occupancy. This may be intentional but limits performance for K-dominant configs.

  11. destroy_pipeline resets cooperative matrix state unnecessarily

  use_cooperative_matrix = false;
  coopmat_M = 0; coopmat_N = 0; coopmat_K = 0;
  ...
  UNROLL_SG_M = 1; UNROLL_SG_N = 1; ...

  These member variables are always fully set in create_pipeline. Resetting them in destroy_pipeline is harmless but unnecessary and adds maintenance burden.

  12. Missing elempack == 8 support

  The code only handles elempack of 1 or 4. Many modern Vulkan implementations support 8-wide packing. This limits performance on hardware that could benefit from wider packing.

  ---
  Summary

  ┌──────┬──────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
  │      │ Severity │ Issue                                                                                    │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ High     │ No real check when coopmat MNK query returns 0 -- division by zero risk                  │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ High     │ Weight repack for elempack==4 in GEMM CM path likely produces wrong layout after reshape │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 3    │ Medium   │ Hardcoded size=1024 may cause suboptimal tiling for very small/large spatial dims        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 4    │ Medium   │ elempack mismatch between create_pipeline and forward has no runtime guard               │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 5    │ Medium   │ Dispatcher vs. local_size arithmetic mismatch in non-CM GEMM path                        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 6    │ Low      │ Duplicate bindings should be verified against shader descriptor layout                   │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 7    │ Low      │ Significant code duplication between GEMM and 1x1s1d1 CM paths                           │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 8    │ Low      │ Poor cache locality in CM weight packing inner loops                                     │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 9-12 │ Low      │ Minor: redundant calculations, missing UNROLL_WG_K, unnecessary resets, no elempack=8    │
  └──────┴──────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

  Issues #1 and #2 are likely correctness bugs that should be addressed before merge.

infact, a lot of advice is unnecessary even error

@nihui nihui requested a review from Copilot May 8, 2026 04:37
@nihui
Copy link
Copy Markdown
Member

nihui commented May 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92eb255d10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (bias_term == 1)
{
sum0 = buffer_ld4(bias_data, gy2.x);
sum2 = buffer_ld4(bias_data, gy2.y);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard second output-pack reads in 1x1 shader

When gy points to the last output-pack row (gy + 1 >= psc(outh)), this path still reads bias_data[gy2.y] and later weight_data via wbase1 for a non-existent second pack. That causes out-of-bounds buffer reads for models where ceil(num_output / 4) is odd (for example num_output=4 or 12), which can produce undefined results or GPU faults.

Useful? React with 👍 / 👎.

Comment on lines +63 to +66
#if !NCNN_shader_local_memory
if (gx >= outsize || gy >= psc(outh))
return;
#endif
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add bounds guard before local-memory GEMM loads

In the NCNN_shader_local_memory build, the early return is disabled, but the shader still immediately reads bias_data[gy] and then input/weight tiles before any bounds check. Because dispatch uses ceil(grouping), extra invocations with gy >= psc(outh) or gx >= outsize are expected whenever output sizes are not multiples of the local size, so these threads perform out-of-bounds reads.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new Vulkan Convolution1D fast paths based on packed GEMM and cooperative-matrix GEMM, and wires in a new perf benchmark to measure Convolution1D performance.

Changes:

  • Added a Convolution1D perf benchmark and registered it in the perf test CMake.
  • Implemented/added Vulkan compute shaders for Convolution1D packed GEMM and packed 1x1, plus cooperative-matrix (CM) GEMM and CM 1x1 variants.
  • Updated Convolution1D_vulkan pipeline creation, weight packing, and forward dispatch logic to select between standard, GEMM, 1x1, and CM variants based on runtime options/device support.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/perf/perf_convolution1d.cpp New perf driver that benchmarks a set of Convolution1D shapes/params via perf_layer.
tests/perf/CMakeLists.txt Registers the Convolution1D perf executable behind WITH_LAYER_convolution1d.
src/layer/vulkan/convolution1d_vulkan.cpp Adds GEMM + CM path selection, weight packing for the new layouts, and dispatch logic for new pipelines.
src/layer/vulkan/convolution1d_vulkan.h Adds new pipeline members and cooperative-matrix tuning/state fields used by the Vulkan implementation.
src/layer/vulkan/shader/convolution1d_packed_gemm.comp New packed GEMM kernel for general Conv1D (non-1x1), including optional local-memory path.
src/layer/vulkan/shader/convolution1d_packed_1x1s1d1.comp New packed 1x1s1d1 kernel specialized for Conv1D pointwise cases.
src/layer/vulkan/shader/convolution1d_gemm_cm.comp New cooperative-matrix GEMM kernel for Conv1D.
src/layer/vulkan/shader/convolution1d_1x1s1d1_cm.comp New cooperative-matrix 1x1s1d1 kernel for Conv1D pointwise cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e1fedca83

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +86 to +87
v0 = buffer_ld4(bottom_blob_data_4, gx2.x + y * psc(w));
v1 = buffer_ld4(bottom_blob_data_4, gx2.y + y * psc(w));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard second x-lane loads in 1x1 packed shader

When outw is odd, the last invocation runs with gx == outw - 1, so gx2.y becomes outw. This branch still unconditionally reads bottom_blob_data_4 at gx2.y (and the pack1 branch does the same pattern), which reads one element past the valid width before the later guarded stores. This can trigger out-of-bounds reads on tail tiles (for example kernel=1,stride=1,dilation=1 where outw == w).

Useful? React with 👍 / 👎.

Comment on lines +247 to +249
v1 = buffer_ld4(bottom_blob_data_4, v_offset.g);
v2 = buffer_ld4(bottom_blob_data_4, v_offset.b);
v3 = buffer_ld4(bottom_blob_data_4, v_offset.a);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid tail-lane input reads beyond output width in GEMM shader

This kernel computes four x positions per invocation, but only checks gx < outsize before entering the loop. On the tail tile (outw % 4 != 0), v_offset.g/b/a correspond to invalid output positions and are still read unconditionally, so input reads can go out of bounds before store-side guards run. A concrete case is w=5,kernel_w=3,stride=1,dilation=1 (outw=3), where the lane for gx+3 dereferences past the row end.

Useful? React with 👍 / 👎.

@nihui nihui merged commit a53d0d4 into Tencent:master May 9, 2026
103 of 107 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented May 9, 2026

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants