Skip to content

massive mips and loongarch optimization#6662

Open
nihui wants to merge 158 commits intoTencent:masterfrom
nihui:mips-opt3
Open

massive mips and loongarch optimization#6662
nihui wants to merge 158 commits intoTencent:masterfrom
nihui:mips-opt3

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Apr 9, 2026

No description provided.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 98.04975% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.50%. Comparing base (10cee2a) to head (a809ad8).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/loongarch/convolution_loongarch.cpp 75.57% 107 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp 97.90% 11 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h 99.75% 3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h 98.88% 3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp 95.00% 2 Missing ⚠️
src/layer/loongarch/convolution_packed.h 99.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   93.92%   94.50%   +0.57%     
==========================================
  Files         933      966      +33     
  Lines      300879   388816   +87937     
==========================================
+ Hits       282599   367437   +84838     
- Misses      18280    21379    +3099     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui and others added 11 commits April 10, 2026 07:10
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile,
transpose_unpack_output_tile, and gemm_transB_packed_tile for all
ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so
jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4).

Update get_optimal_tile_mnk to align TILE_N to multiples of 12
for better utilization of the new kernel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch

Integrate bf16 storage support into multiple operators:

MIPS: batchnorm, clip, dropout, selu, erf
LoongArch: batchnorm, clip, dropout

Each operator now declares forward_inplace_bf16s in its header,
sets support_bf16_storage=true in the constructor, dispatches bf16
inputs from forward_inplace, and implements the bf16s path using
the existing bf16s helper headers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures
- Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes)
- Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies
- Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit)
- Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit)
- Dispatch to bf16 variants when elemsize matches bf16 packing
- Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing
256-bit SIMD (8 floats) resize operations using LASX intrinsics.

Update interp_loongarch.cpp to:
- Include lasxintrin.h and the new pack8 headers under __loongarch_asx
- Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach

- Replace hand-written kernel packing and convolution loops with
  convolution1d_transform_kernel_packed() and convolution1d_packed()
  from convolution1d_packed.h
- Rename weight_data_packed to weight_data_tm to match x86 pattern
- Add LASX (256-bit) support with pack8 out_elempack
- Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16)
- Add bf16 weight/bias cast in dynamic weight forward path
- Include cpu.h, lasxintrin.h headers for new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nihui nihui closed this May 4, 2026
@nihui nihui reopened this May 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1144d5950f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +569 to 579
if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))
{
prefer_winograd43 = false;
if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
{
activation->forward_inplace(top_blob, opt);
prefer_winograd63 = true;
}
}
else if (opt.use_sgemm_convolution)
{
convolution_im2col_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);

if (activation)
else
{
activation->forward_inplace(top_blob, opt);
prefer_winograd23 = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-check Winograd23 availability after 43 fallback

When the preferred Winograd43 path is unavailable (weight_winograd43_data.empty()), this branch unconditionally sets prefer_winograd23 = true without re-validating opt.use_winograd23_convolution or weight_winograd23_data. In rebuild/reconfigure flows (same layer recreated with different Winograd flags or transformed-weight state), this can route execution into Winograd23 with disabled or stale/empty kernels, producing incorrect results or runtime failures instead of falling back to non-Winograd convolution.

Useful? React with 👍 / 👎.

Comment on lines +503 to 513
if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))
{
prefer_winograd43 = false;
if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
{
activation->forward_inplace(top_blob, opt);
prefer_winograd63 = true;
}
}
else if (opt.use_sgemm_convolution)
{
convolution_im2col_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);

if (activation)
else
{
activation->forward_inplace(top_blob, opt);
prefer_winograd23 = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard Winograd23 fallback with option and kernel checks

This fallback from Winograd43 to Winograd23 also forces prefer_winograd23 = true without confirming that Winograd23 is enabled or that transformed 23-kernel data is valid. If Winograd43 data is missing (for example after pipeline rebuilds with different mode selection), the forward path can execute an unsupported/stale Winograd23 configuration and return wrong outputs rather than safely choosing a supported non-Winograd kernel.

Useful? React with 👍 / 👎.

@nihui nihui closed this May 4, 2026
@nihui nihui reopened this May 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52cb17fd87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
prefer_winograd63 = true;
else
prefer_winograd23 = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Revalidate Winograd23 fallback in LoongArch bf16 forward

When Winograd43 is unavailable, this branch unconditionally sets prefer_winograd23 = true without checking opt.use_winograd23_convolution or weight_winograd23_data. In configurations/rebuild flows where 23 is disabled or its transformed weights are stale/empty, forward can dispatch conv3x3s1_winograd23_bf16s with invalid state instead of falling back to a supported non-Winograd path, leading to incorrect algorithm selection or runtime failure.

Useful? React with 👍 / 👎.

if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
prefer_winograd63 = true;
else
prefer_winograd23 = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Revalidate Winograd23 fallback in MIPS bf16 forward

This fallback also forces prefer_winograd23 = true whenever Winograd43 is missing, but it does not confirm that Winograd23 is enabled or that weight_winograd23_data is populated. If a layer is rebuilt with different winograd flags (or only 63 is viable), bf16 forward may enter conv3x3s1_winograd23_bf16s with unsupported/empty kernel data rather than safely choosing a non-Winograd kernel.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1065cdac90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

static NCNN_FORCEINLINE v4i32 __msa_fill_d_ptr(const void* ptr)
{
#if __mips64
return (v4i32)__msa_fill_d(*(const int64_t*)ptr);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use unaligned-safe loads in __msa_fill_d_ptr

__msa_fill_d_ptr() directly dereferences ptr as int64_t*/int*, but several new kernels call it with non-guaranteed alignment (for example pA + 4 on short* tails in convolution_3x3_winograd_int8.h). On MIPS this can raise address-error exceptions (SIGBUS) and is also undefined behavior in C++ for misaligned typed dereferences, so valid model shapes can crash in optimized MSA paths; this helper should use memcpy/unaligned-safe loads like __msa_loadl_d instead of typed pointer loads.

Useful? React with 👍 / 👎.

@nihui nihui closed this May 5, 2026
@nihui nihui reopened this May 5, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3df1f00085

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +404 to 405
if(NCNN_RUNTIME_CPU AND NCNN_MMI)
ncnn_add_arch_opt_source(${class} mmi "-mloongson-mmi")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compile MMI sources when runtime dispatch is disabled

The new guard if(NCNN_RUNTIME_CPU AND NCNN_MMI) drops all -mloongson-mmi optimized sources in static MIPS builds where NCNN_RUNTIME_CPU=OFF but NCNN_MMI=ON, so those targets silently lose MMI kernels and fall back to generic implementations. This is a regression in non-runtime-CPU configurations because MMI codegen should still be compiled whenever NCNN_MMI is enabled, regardless of runtime dispatch.

Useful? React with 👍 / 👎.

@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 5, 2026

3a4000 loongnix-20.rc2
4.19.0-12-loongson-3
gcc 8.3.0

1t baseline pr6662 pr6662-bf16s
squeezenet 50.39 43.98 47.52
squeezenet_int8 72.41 31.37 32.14
mobilenet 88.65 75.43 88.75
mobilenet_int8 167.50 94.52 94.60
mobilenet_v2 56.49 54.21 63.59
mobilenet_v3 49.06 44.91 49.64
shufflenet 32.81 28.13 36.03
shufflenet_v2 29.72 26.85 50.14
mnasnet 60.99 57.07 61.67
proxylessnasnet 73.91 70.10 67.42
efficientnet_b0 114.01 108.94 106.67
efficientnetv2_b0 121.99 110.28 118.51
regnety_400m 80.49 71.92 70.59
blazeface 11.11 8.19 11.52
googlenet 225.69 155.35 160.96
googlenet_int8 294.52 109.17 108.04
resnet18 148.09 127.78 136.44
resnet18_int8 211.71 86.76 87.54
alexnet 190.86 90.68 90.29
vgg16 789.75 629.14 619.40
vgg16_int8 1079.66 507.15 501.72
resnet50 425.48 349.52 382.98
resnet50_int8 581.01 223.40 224.05
squeezenet_ssd 130.99 101.96 109.44
squeezenet_ssd_int8 156.03 77.19 80.61
mobilenet_ssd 180.59 152.79 178.19
mobilenet_ssd_int8 324.44 177.15 178.70
mobilenet_yolo 439.16 350.95 429.48
mobilenetv2_yolov3 205.10 188.08 215.03
yolov4-tiny 264.84 227.00 238.70
nanodet_m 68.40 63.08 114.33
yolo-fastest-1.1 27.54 30.24 39.48
yolo-fastestv2 26.70 34.34 37.26
vision_transformer 15111.31 3310.77 3110.80
FastestDet 30.63 38.17 40.04
4t baseline pr6662 pr6662-bf16s
squeezenet 14.87 13.73 13.84
squeezenet_int8 21.31 12.56 12.75
mobilenet 25.42 21.07 22.92
mobilenet_int8 42.85 26.43 26.80
mobilenet_v2 17.03 16.58 17.68
mobilenet_v3 14.99 14.21 15.89
shufflenet 11.82 10.15 12.98
shufflenet_v2 10.70 9.17 17.12
mnasnet 17.39 17.18 18.06
proxylessnasnet 20.82 20.04 18.94
efficientnet_b0 32.34 31.45 29.96
efficientnetv2_b0 35.44 33.65 34.72
regnety_400m 36.17 27.22 31.86
blazeface 3.94 2.80 3.79
googlenet 65.40 47.53 46.28
googlenet_int8 79.55 36.92 36.94
resnet18 44.34 40.81 41.21
resnet18_int8 56.32 27.14 27.66
alexnet 53.63 27.65 31.35
vgg16 258.47 217.32 213.38
vgg16_int8 293.93 168.97 167.02
resnet50 124.35 103.89 106.14
resnet50_int8 154.59 70.36 70.74
squeezenet_ssd 47.87 35.84 36.03
squeezenet_ssd_int8 49.50 31.48 31.70
mobilenet_ssd 53.55 44.46 46.67
mobilenet_ssd_int8 83.69 49.60 50.67
mobilenet_yolo 159.68 110.33 140.33
mobilenetv2_yolov3 66.82 61.07 62.33
yolov4-tiny 94.52 81.87 79.42
nanodet_m 23.56 21.38 38.06
yolo-fastest-1.1 10.15 12.44 17.80
yolo-fastestv2 10.65 15.58 16.03
vision_transformer 3950.69 930.14 936.21
FastestDet 11.37 16.08 16.13

@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 5, 2026

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t baseline pr6662 pr6662-bf16s
squeezenet 21.25 13.95 13.04
squeezenet_int8 35.65 14.07 13.57
mobilenet 37.77 23.09 27.14
mobilenet_int8 75.81 25.75 26.37
mobilenet_v2 25.06 17.41 18.97
mobilenet_v3 19.97 13.61 16.68
shufflenet 12.67 9.11 10.26
shufflenet_v2 12.24 9.90 14.48
mnasnet 25.07 17.13 19.59
proxylessnasnet 30.95 20.29 22.43
efficientnet_b0 49.33 33.70 36.41
efficientnetv2_b0 55.41 35.45 38.92
regnety_400m 33.78 21.25 22.97
blazeface 5.34 3.06 3.05
googlenet 87.11 51.60 48.50
googlenet_int8 133.64 49.07 48.45
resnet18 68.85 46.54 41.45
resnet18_int8 114.22 40.68 40.48
alexnet 96.10 28.97 30.13
vgg16 360.85 202.68 188.24
vgg16_int8 631.80 215.56 215.99
resnet50 187.97 113.83 117.90
resnet50_int8 295.43 97.92 98.30
squeezenet_ssd 62.01 37.85 37.05
squeezenet_ssd_int8 81.80 37.15 37.27
mobilenet_ssd 75.75 48.12 56.19
mobilenet_ssd_int8 147.48 50.52 52.51
mobilenet_yolo 197.19 109.40 136.56
mobilenetv2_yolov3 86.50 60.73 66.07
yolov4-tiny 117.88 76.88 71.40
nanodet_m 28.51 22.75 32.76
yolo-fastest-1.1 11.88 9.27 13.16
yolo-fastestv2 10.55 11.84 10.32
vision_transformer 4582.75 1131.90 1044.43
FastestDet 11.81 13.16 11.67
4t baseline pr6662 pr6662-bf16s
squeezenet 7.81 5.30 4.51
squeezenet_int8 10.22 5.90 5.24
mobilenet 12.95 7.45 7.22
mobilenet_int8 19.32 7.81 7.87
mobilenet_v2 8.57 6.65 5.61
mobilenet_v3 6.71 5.57 5.79
shufflenet 4.62 4.25 4.37
shufflenet_v2 4.40 4.56 5.67
mnasnet 7.72 5.73 6.00
proxylessnasnet 9.19 6.42 6.58
efficientnet_b0 14.82 11.66 10.63
efficientnetv2_b0 18.47 12.99 12.73
regnety_400m 14.93 11.38 12.31
blazeface 1.84 1.19 1.16
googlenet 28.37 20.03 17.86
googlenet_int8 36.93 18.92 17.77
resnet18 25.74 20.70 19.03
resnet18_int8 32.41 15.51 15.19
alexnet 31.40 14.62 14.01
vgg16 154.29 116.29 101.91
vgg16_int8 189.38 88.30 84.73
resnet50 66.72 43.77 40.29
resnet50_int8 78.62 37.57 33.78
squeezenet_ssd 28.95 19.96 17.67
squeezenet_ssd_int8 28.73 19.23 16.18
mobilenet_ssd 28.96 17.07 15.94
mobilenet_ssd_int8 38.48 16.18 17.90
mobilenet_yolo 96.66 39.68 49.54
mobilenetv2_yolov3 35.14 25.44 21.96
yolov4-tiny 56.07 37.11 32.84
nanodet_m 10.74 10.29 12.38
yolo-fastest-1.1 4.57 4.72 6.42
yolo-fastestv2 4.33 6.10 5.43
vision_transformer 1217.27 369.71 295.17
FastestDet 4.73 6.19 5.50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants