conv1d gemm with cm by futz12 · Pull Request #6587 · Tencent/ncnn

futz12 · 2026-03-08T12:07:37Z

No description provided.

codecov-commenter · 2026-03-08T12:56:42Z

Codecov Report

❌ Patch coverage is 47.49263% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.52%. Comparing base (8c18666) to head (4e1fedc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/vulkan/convolution1d_vulkan.cpp	47.49%	178 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6587      +/-   ##
==========================================
- Coverage   93.59%   93.52%   -0.07%     
==========================================
  Files         933      933              
  Lines      299557   299971     +414     
==========================================
+ Hits       280360   280554     +194     
- Misses      19197    19417     +220

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

futz12 · 2026-04-02T03:17:42Z

@nihui run ci please ^_^

tencent-adm · 2026-04-02T03:18:15Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ futz12
❌ nihui
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

nihui · 2026-04-02T09:24:26Z

  ---
  PR Overview

  This PR adds GEMM-based and 1x1s1d1 optimized paths for Convolution1D on the Vulkan backend, with cooperative matrix (CM) support. It modifies convolution1d_vulkan.cpp/.h and adds 4 new shader files.

  ---
  Potential Logic Errors

  1. Missing assertion / no fallback when cooperative matrix query returns zeros

  In both the GEMM and 1x1s1d1 CM paths:
  // assert coopmat_M != 0 && coopmat_N != 0 && coopmat_K != 0
  This is only a comment, not an actual assertion or check. If get_optimal_cooperative_matrix_mnk returns zeros (e.g., no supported configuration), the code proceeds to divide by zero in UNROLL_SG_* calculations and creates invalid
  pipelines. This should be a real check with a fallback to the non-CM path.

  2. Weight repacking in the GEMM CM path uses weight_data directly when elempack != 4

  if (elempack == 4) {
      // repack from maxk-inch-outch to 4-maxk-inch/4-outch
      weight_data_r2.create(4 * maxk * (num_input / 4) * num_output);
      ...
  } else {
      weight_data_r2 = weight_data;  // flat, not reshaped yet
  }
  // from maxk-inch-outch to inch-maxk-outch
  Mat weight_data_r2_r = weight_data_r2.reshape(maxk, num_input, num_output);

  When elempack == 4, weight_data_r2 is created with dimensions 4 * maxk * (num_input / 4) * num_output which equals maxk * num_input * num_output -- the same total element count. But the memory layout has been changed to
  4-maxk-inch/4-outch, so the subsequent .reshape(maxk, num_input, num_output) interprets this rearranged data as if it were still in the original layout. The reshape doesn't re-interpret correctly after the repack -- the data is in
  [outch][inch/4][maxk][4] order but reshape treats it as [outch][inch][maxk]. This is likely a data layout mismatch bug that would produce incorrect weight values in the CM path when elempack == 4.

  3. Hardcoded size = 1024 for cooperative matrix tile calculation

  int size = 1024;  // used for UNROLL_SG_M and UNROLL_WG_M

  This is used as a proxy for the spatial dimension (outw) in the M-dimension tile calculations. However, the actual output width (outw) is only known at forward time, not create_pipeline time. This means:
  - If outw is much smaller than 1024, the unrolling factors will be too aggressive, causing many wasted threads processing out-of-bounds elements.
  - If outw is much larger than 1024, the tile factors might be suboptimal (though capped at 2, so less harmful).

  This is arguably a design trade-off, but the magic number should at minimum be documented.

  4. elempack calculated differently at create_pipeline vs. forward time

  In create_pipeline:
  int elempack = num_input % 4 == 0 ? 4 : 1;

  In forward:
  int elempack = bottom_blob.elempack;  // comes from the actual input blob

  Then num_input in forward is:
  const int num_input = bottom_blob_bordered.h * elempack;

  If the runtime elempack differs from the one assumed at pipeline creation, the weight layout and specialization constants will be inconsistent with the actual data layout, producing incorrect results. There's no guard checking that
  the runtime elempack matches expectations.

  5. Non-CM GEMM path: dispatcher dimensions don't match workgroup logic

  // Non-CM GEMM forward:
  VkMat dispatcher;
  dispatcher.w = (top_blob.w + 3) / 4;
  dispatcher.h = num_output_packed / 4;
  dispatcher.c = 1;

  But the local size was set as:
  pipeline_convolution1d_gemm->set_local_size_xyz(8, 8, 1);
  // or
  pipeline_convolution1d_gemm->set_local_size_xyz(16, std::min(4, num_output_packed / 4), 1);

  The dispatcher divides top_blob.w by 4 (rounding up), which implies each workgroup processes 4 output elements spatially. But the local size in x is 8 or 16 -- this means each thread covers a different fraction of the spatial
  dimension. Whether this matches the shader logic is unclear without the shader, but the asymmetry between the +3)/4 dispatch and local sizes of 8/16 is suspicious and may cause either redundant computation or missing coverage.

  6. Bindings inconsistency in non-CM paths

  // Non-CM GEMM forward:
  std::vector<VkMat> bindings(6);
  bindings[0] = bottom_blob_bordered;
  bindings[1] = top_blob;
  bindings[2] = bottom_blob_bordered;  // duplicate of bindings[0]
  bindings[3] = top_blob;              // duplicate of bindings[1]
  bindings[4] = weight_data_gpu;
  bindings[5] = bias_data_gpu;

  Bindings 0-1 and 2-3 are duplicated. This is a common pattern in ncnn for supporting different storage types (image vs. buffer), but it's worth verifying these 6 bindings match what the convolution1d_packed_gemm shader expects. If the
   shader only declares 4 descriptor bindings, this would be a mismatch.

  ---
  Suboptimal Implementation Issues

  7. Massive code duplication between GEMM and 1x1s1d1 CM paths

  The cooperative matrix pipeline creation code for GEMM and 1x1s1d1 is nearly identical (~100 lines each):
  - Same UNROLL calculation logic
  - Same weight packing loop structure (with minor dimension differences)
  - Same specialization constant setup
  - Same pipeline creation

  This should be refactored into a shared helper function with parameters for the dimension differences.

  8. Weight packing has O(n^4) nested loops with poor cache locality

  In the CM weight packing:
  for (int q = 0; q < num_output_packed; q += coopmat_N * UNROLL_SG_N * UNROLL_WG_N) {
      for (int p = 0; p < num_input_packed * maxk; p += coopmat_K * UNROLL_SG_K) {
          for (int n = 0; n < coopmat_N * UNROLL_SG_N * UNROLL_WG_N; n++) {
              for (int k = 0; k < coopmat_K * UNROLL_SG_K; k++) {
                  // random access into weight_data_r2_r via .channel().row()[]

  The innermost loop accesses weight_data_r2_r.channel(q + n).row(w)[kx] which involves channel + row pointer arithmetic on every iteration. This has poor spatial locality and could be significantly improved by hoisting the channel/row
  lookups.

  9. Redundant num_input_packed / num_output_packed calculations

  These are computed identically ((x + 3) / 4 * 4) in multiple branches. They should be computed once at the top of create_pipeline and reused.

  10. No UNROLL_WG_K parameter

  The unrolling scheme defines UNROLL_WG_M and UNROLL_WG_N for the workgroup level but only UNROLL_SG_K for the K dimension. There's no UNROLL_WG_K. This is asymmetric -- on hardware where the K dimension is large (e.g., large num_input
   * maxk), having a workgroup-level K unroll could improve occupancy. This may be intentional but limits performance for K-dominant configs.

  11. destroy_pipeline resets cooperative matrix state unnecessarily

  use_cooperative_matrix = false;
  coopmat_M = 0; coopmat_N = 0; coopmat_K = 0;
  ...
  UNROLL_SG_M = 1; UNROLL_SG_N = 1; ...

  These member variables are always fully set in create_pipeline. Resetting them in destroy_pipeline is harmless but unnecessary and adds maintenance burden.

  12. Missing elempack == 8 support

  The code only handles elempack of 1 or 4. Many modern Vulkan implementations support 8-wide packing. This limits performance on hardware that could benefit from wider packing.

  ---
  Summary

  ┌──────┬──────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
  │      │ Severity │ Issue                                                                                    │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ High     │ No real check when coopmat MNK query returns 0 -- division by zero risk                  │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ High     │ Weight repack for elempack==4 in GEMM CM path likely produces wrong layout after reshape │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 3    │ Medium   │ Hardcoded size=1024 may cause suboptimal tiling for very small/large spatial dims        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 4    │ Medium   │ elempack mismatch between create_pipeline and forward has no runtime guard               │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 5    │ Medium   │ Dispatcher vs. local_size arithmetic mismatch in non-CM GEMM path                        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 6    │ Low      │ Duplicate bindings should be verified against shader descriptor layout                   │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 7    │ Low      │ Significant code duplication between GEMM and 1x1s1d1 CM paths                           │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 8    │ Low      │ Poor cache locality in CM weight packing inner loops                                     │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 9-12 │ Low      │ Minor: redundant calculations, missing UNROLL_WG_K, unnecessary resets, no elempack=8    │
  └──────┴──────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

  Issues #1 and #2 are likely correctness bugs that should be addressed before merge.

futz12 · 2026-04-13T11:52:35Z

  ---
  PR Overview

  This PR adds GEMM-based and 1x1s1d1 optimized paths for Convolution1D on the Vulkan backend, with cooperative matrix (CM) support. It modifies convolution1d_vulkan.cpp/.h and adds 4 new shader files.

  ---
  Potential Logic Errors

  1. Missing assertion / no fallback when cooperative matrix query returns zeros

  In both the GEMM and 1x1s1d1 CM paths:
  // assert coopmat_M != 0 && coopmat_N != 0 && coopmat_K != 0
  This is only a comment, not an actual assertion or check. If get_optimal_cooperative_matrix_mnk returns zeros (e.g., no supported configuration), the code proceeds to divide by zero in UNROLL_SG_* calculations and creates invalid
  pipelines. This should be a real check with a fallback to the non-CM path.

  2. Weight repacking in the GEMM CM path uses weight_data directly when elempack != 4

  if (elempack == 4) {
      // repack from maxk-inch-outch to 4-maxk-inch/4-outch
      weight_data_r2.create(4 * maxk * (num_input / 4) * num_output);
      ...
  } else {
      weight_data_r2 = weight_data;  // flat, not reshaped yet
  }
  // from maxk-inch-outch to inch-maxk-outch
  Mat weight_data_r2_r = weight_data_r2.reshape(maxk, num_input, num_output);

  When elempack == 4, weight_data_r2 is created with dimensions 4 * maxk * (num_input / 4) * num_output which equals maxk * num_input * num_output -- the same total element count. But the memory layout has been changed to
  4-maxk-inch/4-outch, so the subsequent .reshape(maxk, num_input, num_output) interprets this rearranged data as if it were still in the original layout. The reshape doesn't re-interpret correctly after the repack -- the data is in
  [outch][inch/4][maxk][4] order but reshape treats it as [outch][inch][maxk]. This is likely a data layout mismatch bug that would produce incorrect weight values in the CM path when elempack == 4.

  3. Hardcoded size = 1024 for cooperative matrix tile calculation

  int size = 1024;  // used for UNROLL_SG_M and UNROLL_WG_M

  This is used as a proxy for the spatial dimension (outw) in the M-dimension tile calculations. However, the actual output width (outw) is only known at forward time, not create_pipeline time. This means:
  - If outw is much smaller than 1024, the unrolling factors will be too aggressive, causing many wasted threads processing out-of-bounds elements.
  - If outw is much larger than 1024, the tile factors might be suboptimal (though capped at 2, so less harmful).

  This is arguably a design trade-off, but the magic number should at minimum be documented.

  4. elempack calculated differently at create_pipeline vs. forward time

  In create_pipeline:
  int elempack = num_input % 4 == 0 ? 4 : 1;

  In forward:
  int elempack = bottom_blob.elempack;  // comes from the actual input blob

  Then num_input in forward is:
  const int num_input = bottom_blob_bordered.h * elempack;

  If the runtime elempack differs from the one assumed at pipeline creation, the weight layout and specialization constants will be inconsistent with the actual data layout, producing incorrect results. There's no guard checking that
  the runtime elempack matches expectations.

  5. Non-CM GEMM path: dispatcher dimensions don't match workgroup logic

  // Non-CM GEMM forward:
  VkMat dispatcher;
  dispatcher.w = (top_blob.w + 3) / 4;
  dispatcher.h = num_output_packed / 4;
  dispatcher.c = 1;

  But the local size was set as:
  pipeline_convolution1d_gemm->set_local_size_xyz(8, 8, 1);
  // or
  pipeline_convolution1d_gemm->set_local_size_xyz(16, std::min(4, num_output_packed / 4), 1);

  The dispatcher divides top_blob.w by 4 (rounding up), which implies each workgroup processes 4 output elements spatially. But the local size in x is 8 or 16 -- this means each thread covers a different fraction of the spatial
  dimension. Whether this matches the shader logic is unclear without the shader, but the asymmetry between the +3)/4 dispatch and local sizes of 8/16 is suspicious and may cause either redundant computation or missing coverage.

  6. Bindings inconsistency in non-CM paths

  // Non-CM GEMM forward:
  std::vector<VkMat> bindings(6);
  bindings[0] = bottom_blob_bordered;
  bindings[1] = top_blob;
  bindings[2] = bottom_blob_bordered;  // duplicate of bindings[0]
  bindings[3] = top_blob;              // duplicate of bindings[1]
  bindings[4] = weight_data_gpu;
  bindings[5] = bias_data_gpu;

  Bindings 0-1 and 2-3 are duplicated. This is a common pattern in ncnn for supporting different storage types (image vs. buffer), but it's worth verifying these 6 bindings match what the convolution1d_packed_gemm shader expects. If the
   shader only declares 4 descriptor bindings, this would be a mismatch.

  ---
  Suboptimal Implementation Issues

  7. Massive code duplication between GEMM and 1x1s1d1 CM paths

  The cooperative matrix pipeline creation code for GEMM and 1x1s1d1 is nearly identical (~100 lines each):
  - Same UNROLL calculation logic
  - Same weight packing loop structure (with minor dimension differences)
  - Same specialization constant setup
  - Same pipeline creation

  This should be refactored into a shared helper function with parameters for the dimension differences.

  8. Weight packing has O(n^4) nested loops with poor cache locality

  In the CM weight packing:
  for (int q = 0; q < num_output_packed; q += coopmat_N * UNROLL_SG_N * UNROLL_WG_N) {
      for (int p = 0; p < num_input_packed * maxk; p += coopmat_K * UNROLL_SG_K) {
          for (int n = 0; n < coopmat_N * UNROLL_SG_N * UNROLL_WG_N; n++) {
              for (int k = 0; k < coopmat_K * UNROLL_SG_K; k++) {
                  // random access into weight_data_r2_r via .channel().row()[]

  The innermost loop accesses weight_data_r2_r.channel(q + n).row(w)[kx] which involves channel + row pointer arithmetic on every iteration. This has poor spatial locality and could be significantly improved by hoisting the channel/row
  lookups.

  9. Redundant num_input_packed / num_output_packed calculations

  These are computed identically ((x + 3) / 4 * 4) in multiple branches. They should be computed once at the top of create_pipeline and reused.

  10. No UNROLL_WG_K parameter

  The unrolling scheme defines UNROLL_WG_M and UNROLL_WG_N for the workgroup level but only UNROLL_SG_K for the K dimension. There's no UNROLL_WG_K. This is asymmetric -- on hardware where the K dimension is large (e.g., large num_input
   * maxk), having a workgroup-level K unroll could improve occupancy. This may be intentional but limits performance for K-dominant configs.

  11. destroy_pipeline resets cooperative matrix state unnecessarily

  use_cooperative_matrix = false;
  coopmat_M = 0; coopmat_N = 0; coopmat_K = 0;
  ...
  UNROLL_SG_M = 1; UNROLL_SG_N = 1; ...

  These member variables are always fully set in create_pipeline. Resetting them in destroy_pipeline is harmless but unnecessary and adds maintenance burden.

  12. Missing elempack == 8 support

  The code only handles elempack of 1 or 4. Many modern Vulkan implementations support 8-wide packing. This limits performance on hardware that could benefit from wider packing.

  ---
  Summary

  ┌──────┬──────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
  │      │ Severity │ Issue                                                                                    │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ High     │ No real check when coopmat MNK query returns 0 -- division by zero risk                  │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ High     │ Weight repack for elempack==4 in GEMM CM path likely produces wrong layout after reshape │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 3    │ Medium   │ Hardcoded size=1024 may cause suboptimal tiling for very small/large spatial dims        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 4    │ Medium   │ elempack mismatch between create_pipeline and forward has no runtime guard               │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 5    │ Medium   │ Dispatcher vs. local_size arithmetic mismatch in non-CM GEMM path                        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 6    │ Low      │ Duplicate bindings should be verified against shader descriptor layout                   │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 7    │ Low      │ Significant code duplication between GEMM and 1x1s1d1 CM paths                           │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 8    │ Low      │ Poor cache locality in CM weight packing inner loops                                     │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 9-12 │ Low      │ Minor: redundant calculations, missing UNROLL_WG_K, unnecessary resets, no elempack=8    │
  └──────┴──────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

  Issues #1 and #2 are likely correctness bugs that should be addressed before merge.

infact, a lot of advice is unnecessary even error

nihui · 2026-05-08T04:38:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92eb255d10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T04:48:49Z

+    if (bias_term == 1)
+    {
+        sum0 = buffer_ld4(bias_data, gy2.x);
+        sum2 = buffer_ld4(bias_data, gy2.y);


Guard second output-pack reads in 1x1 shader

When gy points to the last output-pack row (gy + 1 >= psc(outh)), this path still reads bias_data[gy2.y] and later weight_data via wbase1 for a non-existent second pack. That causes out-of-bounds buffer reads for models where ceil(num_output / 4) is odd (for example num_output=4 or 12), which can produce undefined results or GPU faults.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-08T04:48:50Z

+#if !NCNN_shader_local_memory
+    if (gx >= outsize || gy >= psc(outh))
+        return;
+#endif


Add bounds guard before local-memory GEMM loads

In the NCNN_shader_local_memory build, the early return is disabled, but the shader still immediately reads bias_data[gy] and then input/weight tiles before any bounds check. Because dispatch uses ceil(grouping), extra invocations with gy >= psc(outh) or gx >= outsize are expected whenever output sizes are not multiples of the local size, so these threads perform out-of-bounds reads.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR adds new Vulkan Convolution1D fast paths based on packed GEMM and cooperative-matrix GEMM, and wires in a new perf benchmark to measure Convolution1D performance.

Changes:

Added a Convolution1D perf benchmark and registered it in the perf test CMake.
Implemented/added Vulkan compute shaders for Convolution1D packed GEMM and packed 1x1, plus cooperative-matrix (CM) GEMM and CM 1x1 variants.
Updated Convolution1D_vulkan pipeline creation, weight packing, and forward dispatch logic to select between standard, GEMM, 1x1, and CM variants based on runtime options/device support.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/perf/perf_convolution1d.cpp	New perf driver that benchmarks a set of Convolution1D shapes/params via `perf_layer`.
tests/perf/CMakeLists.txt	Registers the Convolution1D perf executable behind `WITH_LAYER_convolution1d`.
src/layer/vulkan/convolution1d_vulkan.cpp	Adds GEMM + CM path selection, weight packing for the new layouts, and dispatch logic for new pipelines.
src/layer/vulkan/convolution1d_vulkan.h	Adds new pipeline members and cooperative-matrix tuning/state fields used by the Vulkan implementation.
src/layer/vulkan/shader/convolution1d_packed_gemm.comp	New packed GEMM kernel for general Conv1D (non-1x1), including optional local-memory path.
src/layer/vulkan/shader/convolution1d_packed_1x1s1d1.comp	New packed 1x1s1d1 kernel specialized for Conv1D pointwise cases.
src/layer/vulkan/shader/convolution1d_gemm_cm.comp	New cooperative-matrix GEMM kernel for Conv1D.
src/layer/vulkan/shader/convolution1d_1x1s1d1_cm.comp	New cooperative-matrix 1x1s1d1 kernel for Conv1D pointwise cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e1fedca83

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T01:44:51Z

+            v0 = buffer_ld4(bottom_blob_data_4, gx2.x + y * psc(w));
+            v1 = buffer_ld4(bottom_blob_data_4, gx2.y + y * psc(w));


Guard second x-lane loads in 1x1 packed shader

When outw is odd, the last invocation runs with gx == outw - 1, so gx2.y becomes outw. This branch still unconditionally reads bottom_blob_data_4 at gx2.y (and the pack1 branch does the same pattern), which reads one element past the valid width before the later guarded stores. This can trigger out-of-bounds reads on tail tiles (for example kernel=1,stride=1,dilation=1 where outw == w).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T01:44:51Z

+            v1 = buffer_ld4(bottom_blob_data_4, v_offset.g);
+            v2 = buffer_ld4(bottom_blob_data_4, v_offset.b);
+            v3 = buffer_ld4(bottom_blob_data_4, v_offset.a);


Avoid tail-lane input reads beyond output width in GEMM shader

This kernel computes four x positions per invocation, but only checks gx < outsize before entering the loop. On the tail tile (outw % 4 != 0), v_offset.g/b/a correspond to invalid output positions and are still read unconditionally, so input reads can go out of bounds before store-side guards run. A concrete case is w=5,kernel_w=3,stride=1,dilation=1 (outw=3), where the lane for gx+3 dereferences past the row end.

Useful? React with 👍 / 👎.

nihui · 2026-05-09T03:48:47Z

Thanks for your contribution !

conv1d gemm

f7ba822

github-actions Bot added vulkan test labels Mar 8, 2026

apply code-format changes

887f94c

nihui closed this Mar 8, 2026

nihui reopened this Mar 8, 2026

futz12 changed the title ~~conv1d gemm~~ [WIP]conv1d gemm Mar 8, 2026

futz12 and others added 2 commits April 2, 2026 10:07

Merge branch 'master' into conv1d-opt

8da1204

cm support

22a7d4d

futz12 changed the title ~~[WIP]conv1d gemm~~ conv1d gemm with gemm & cm Apr 2, 2026

futz12 changed the title ~~conv1d gemm with gemm & cm~~ conv1d gemm with cm Apr 2, 2026

nihui added 2 commits May 8, 2026 10:20

Merge branch 'master' into conv1d-opt

a4b3577

fix

92eb255

nihui requested a review from Copilot May 8, 2026 04:37

Copilot started reviewing on behalf of nihui May 8, 2026 04:38 View session

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

nihui added 2 commits May 8, 2026 14:31

fix style

a4a2628

Merge branch 'master' into conv1d-opt

4e1fedc

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

nihui merged commit a53d0d4 into Tencent:master May 9, 2026
103 of 107 checks passed

		v0 = buffer_ld4(bottom_blob_data_4, gx2.x + y * psc(w));
		v1 = buffer_ld4(bottom_blob_data_4, gx2.y + y * psc(w));

Conversation

futz12 commented Mar 8, 2026

Uh oh!

codecov-commenter commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

futz12 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nihui commented Apr 2, 2026

Uh oh!

futz12 commented Apr 13, 2026

Uh oh!

nihui commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nihui commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Mar 8, 2026 •

edited

Loading

futz12 commented Apr 2, 2026 •

edited

Loading

tencent-adm commented Apr 2, 2026 •

edited

Loading