massive mips and loongarch optimization#6662
Conversation
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6662 +/- ##
==========================================
+ Coverage 93.92% 94.50% +0.57%
==========================================
Files 933 966 +33
Lines 300879 388816 +87937
==========================================
+ Hits 282599 367437 +84838
- Misses 18280 21379 +3099 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1144d5950f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty())) | ||
| { | ||
| prefer_winograd43 = false; | ||
| if (opt.use_winograd63_convolution && !weight_winograd63_data.empty()) | ||
| { | ||
| activation->forward_inplace(top_blob, opt); | ||
| prefer_winograd63 = true; | ||
| } | ||
| } | ||
| else if (opt.use_sgemm_convolution) | ||
| { | ||
| convolution_im2col_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt); | ||
|
|
||
| if (activation) | ||
| else | ||
| { | ||
| activation->forward_inplace(top_blob, opt); | ||
| prefer_winograd23 = true; | ||
| } |
There was a problem hiding this comment.
Re-check Winograd23 availability after 43 fallback
When the preferred Winograd43 path is unavailable (weight_winograd43_data.empty()), this branch unconditionally sets prefer_winograd23 = true without re-validating opt.use_winograd23_convolution or weight_winograd23_data. In rebuild/reconfigure flows (same layer recreated with different Winograd flags or transformed-weight state), this can route execution into Winograd23 with disabled or stale/empty kernels, producing incorrect results or runtime failures instead of falling back to non-Winograd convolution.
Useful? React with 👍 / 👎.
| if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty())) | ||
| { | ||
| prefer_winograd43 = false; | ||
| if (opt.use_winograd63_convolution && !weight_winograd63_data.empty()) | ||
| { | ||
| activation->forward_inplace(top_blob, opt); | ||
| prefer_winograd63 = true; | ||
| } | ||
| } | ||
| else if (opt.use_sgemm_convolution) | ||
| { | ||
| convolution_im2col_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt); | ||
|
|
||
| if (activation) | ||
| else | ||
| { | ||
| activation->forward_inplace(top_blob, opt); | ||
| prefer_winograd23 = true; | ||
| } |
There was a problem hiding this comment.
Guard Winograd23 fallback with option and kernel checks
This fallback from Winograd43 to Winograd23 also forces prefer_winograd23 = true without confirming that Winograd23 is enabled or that transformed 23-kernel data is valid. If Winograd43 data is missing (for example after pipeline rebuilds with different mode selection), the forward path can execute an unsupported/stale Winograd23 configuration and return wrong outputs rather than safely choosing a supported non-Winograd kernel.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52cb17fd87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (opt.use_winograd63_convolution && !weight_winograd63_data.empty()) | ||
| prefer_winograd63 = true; | ||
| else | ||
| prefer_winograd23 = true; |
There was a problem hiding this comment.
Revalidate Winograd23 fallback in LoongArch bf16 forward
When Winograd43 is unavailable, this branch unconditionally sets prefer_winograd23 = true without checking opt.use_winograd23_convolution or weight_winograd23_data. In configurations/rebuild flows where 23 is disabled or its transformed weights are stale/empty, forward can dispatch conv3x3s1_winograd23_bf16s with invalid state instead of falling back to a supported non-Winograd path, leading to incorrect algorithm selection or runtime failure.
Useful? React with 👍 / 👎.
| if (opt.use_winograd63_convolution && !weight_winograd63_data.empty()) | ||
| prefer_winograd63 = true; | ||
| else | ||
| prefer_winograd23 = true; |
There was a problem hiding this comment.
Revalidate Winograd23 fallback in MIPS bf16 forward
This fallback also forces prefer_winograd23 = true whenever Winograd43 is missing, but it does not confirm that Winograd23 is enabled or that weight_winograd23_data is populated. If a layer is rebuilt with different winograd flags (or only 63 is viable), bf16 forward may enter conv3x3s1_winograd23_bf16s with unsupported/empty kernel data rather than safely choosing a non-Winograd kernel.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1065cdac90
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| static NCNN_FORCEINLINE v4i32 __msa_fill_d_ptr(const void* ptr) | ||
| { | ||
| #if __mips64 | ||
| return (v4i32)__msa_fill_d(*(const int64_t*)ptr); |
There was a problem hiding this comment.
Use unaligned-safe loads in __msa_fill_d_ptr
__msa_fill_d_ptr() directly dereferences ptr as int64_t*/int*, but several new kernels call it with non-guaranteed alignment (for example pA + 4 on short* tails in convolution_3x3_winograd_int8.h). On MIPS this can raise address-error exceptions (SIGBUS) and is also undefined behavior in C++ for misaligned typed dereferences, so valid model shapes can crash in optimized MSA paths; this helper should use memcpy/unaligned-safe loads like __msa_loadl_d instead of typed pointer loads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3df1f00085
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if(NCNN_RUNTIME_CPU AND NCNN_MMI) | ||
| ncnn_add_arch_opt_source(${class} mmi "-mloongson-mmi") |
There was a problem hiding this comment.
Compile MMI sources when runtime dispatch is disabled
The new guard if(NCNN_RUNTIME_CPU AND NCNN_MMI) drops all -mloongson-mmi optimized sources in static MIPS builds where NCNN_RUNTIME_CPU=OFF but NCNN_MMI=ON, so those targets silently lose MMI kernels and fall back to generic implementations. This is a regression in non-runtime-CPU configurations because MMI codegen should still be compiled whenever NCNN_MMI is enabled, regardless of runtime dispatch.
Useful? React with 👍 / 👎.
|
3a4000 loongnix-20.rc2
|
|
3a6000 loongnix-20
|
No description provided.