massive mips and loongarch optimization by nihui · Pull Request #6662 · Tencent/ncnn

nihui · 2026-04-09T08:56:07Z

No description provided.

tencent-adm · 2026-04-09T08:56:26Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-04-09T09:01:43Z

Codecov Report

❌ Patch coverage is 98.04975% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.50%. Comparing base (10cee2a) to head (a809ad8).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/loongarch/convolution_loongarch.cpp	75.57%	107 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp	97.90%	11 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h	99.75%	3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h	98.88%	3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp	95.00%	2 Missing ⚠️
src/layer/loongarch/convolution_packed.h	99.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   93.92%   94.50%   +0.57%     
==========================================
  Files         933      966      +33     
  Lines      300879   388816   +87937     
==========================================
+ Hits       282599   367437   +84838     
- Misses      18280    21379    +3099

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1144d5950f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T15:05:18Z

+        if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))
+        {
+            prefer_winograd43 = false;
+            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
            {
-                activation->forward_inplace(top_blob, opt);
+                prefer_winograd63 = true;
            }
-        }
-        else if (opt.use_sgemm_convolution)
-        {
-            convolution_im2col_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);
-
-            if (activation)
+            else
            {
-                activation->forward_inplace(top_blob, opt);
+                prefer_winograd23 = true;
            }


Re-check Winograd23 availability after 43 fallback

When the preferred Winograd43 path is unavailable (weight_winograd43_data.empty()), this branch unconditionally sets prefer_winograd23 = true without re-validating opt.use_winograd23_convolution or weight_winograd23_data. In rebuild/reconfigure flows (same layer recreated with different Winograd flags or transformed-weight state), this can route execution into Winograd23 with disabled or stale/empty kernels, producing incorrect results or runtime failures instead of falling back to non-Winograd convolution.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T15:05:18Z

+        if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))
+        {
+            prefer_winograd43 = false;
+            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
            {
-                activation->forward_inplace(top_blob, opt);
+                prefer_winograd63 = true;
            }
-        }
-        else if (opt.use_sgemm_convolution)
-        {
-            convolution_im2col_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);
-
-            if (activation)
+            else
            {
-                activation->forward_inplace(top_blob, opt);
+                prefer_winograd23 = true;
            }


Guard Winograd23 fallback with option and kernel checks

This fallback from Winograd43 to Winograd23 also forces prefer_winograd23 = true without confirming that Winograd23 is enabled or that transformed 23-kernel data is valid. If Winograd43 data is missing (for example after pipeline rebuilds with different mode selection), the forward path can execute an unsupported/stale Winograd23 configuration and return wrong outputs rather than safely choosing a supported non-Winograd kernel.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52cb17fd87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T16:32:06Z

+            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
+                prefer_winograd63 = true;
+            else
+                prefer_winograd23 = true;


Revalidate Winograd23 fallback in LoongArch bf16 forward

When Winograd43 is unavailable, this branch unconditionally sets prefer_winograd23 = true without checking opt.use_winograd23_convolution or weight_winograd23_data. In configurations/rebuild flows where 23 is disabled or its transformed weights are stale/empty, forward can dispatch conv3x3s1_winograd23_bf16s with invalid state instead of falling back to a supported non-Winograd path, leading to incorrect algorithm selection or runtime failure.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T16:32:06Z

+            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())
+                prefer_winograd63 = true;
+            else
+                prefer_winograd23 = true;


Revalidate Winograd23 fallback in MIPS bf16 forward

This fallback also forces prefer_winograd23 = true whenever Winograd43 is missing, but it does not confirm that Winograd23 is enabled or that weight_winograd23_data is populated. If a layer is rebuilt with different winograd flags (or only 63 is viable), bf16 forward may enter conv3x3s1_winograd23_bf16s with unsupported/empty kernel data rather than safely choosing a non-Winograd kernel.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1065cdac90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-05T04:27:54Z

+static NCNN_FORCEINLINE v4i32 __msa_fill_d_ptr(const void* ptr)
+{
+#if __mips64
+    return (v4i32)__msa_fill_d(*(const int64_t*)ptr);


Use unaligned-safe loads in __msa_fill_d_ptr

__msa_fill_d_ptr() directly dereferences ptr as int64_t*/int*, but several new kernels call it with non-guaranteed alignment (for example pA + 4 on short* tails in convolution_3x3_winograd_int8.h). On MIPS this can raise address-error exceptions (SIGBUS) and is also undefined behavior in C++ for misaligned typed dereferences, so valid model shapes can crash in optimized MSA paths; this helper should use memcpy/unaligned-safe loads like __msa_loadl_d instead of typed pointer loads.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3df1f00085

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-05T06:17:59Z

+        if(NCNN_RUNTIME_CPU AND NCNN_MMI)
            ncnn_add_arch_opt_source(${class} mmi "-mloongson-mmi")


Compile MMI sources when runtime dispatch is disabled

The new guard if(NCNN_RUNTIME_CPU AND NCNN_MMI) drops all -mloongson-mmi optimized sources in static MIPS builds where NCNN_RUNTIME_CPU=OFF but NCNN_MMI=ON, so those targets silently lose MMI kernels and fall back to generic implementations. This is a regression in non-runtime-CPU configurations because MMI codegen should still be compiled whenever NCNN_MMI is enabled, regardless of runtime dispatch.

Useful? React with 👍 / 👎.

nihui · 2026-05-05T11:06:20Z

3a4000 loongnix-20.rc2
4.19.0-12-loongson-3
gcc 8.3.0

1t	baseline	pr6662	pr6662-bf16s
squeezenet	50.39	43.98	47.52
squeezenet_int8	72.41	31.37	32.14
mobilenet	88.65	75.43	88.75
mobilenet_int8	167.50	94.52	94.60
mobilenet_v2	56.49	54.21	63.59
mobilenet_v3	49.06	44.91	49.64
shufflenet	32.81	28.13	36.03
shufflenet_v2	29.72	26.85	50.14
mnasnet	60.99	57.07	61.67
proxylessnasnet	73.91	70.10	67.42
efficientnet_b0	114.01	108.94	106.67
efficientnetv2_b0	121.99	110.28	118.51
regnety_400m	80.49	71.92	70.59
blazeface	11.11	8.19	11.52
googlenet	225.69	155.35	160.96
googlenet_int8	294.52	109.17	108.04
resnet18	148.09	127.78	136.44
resnet18_int8	211.71	86.76	87.54
alexnet	190.86	90.68	90.29
vgg16	789.75	629.14	619.40
vgg16_int8	1079.66	507.15	501.72
resnet50	425.48	349.52	382.98
resnet50_int8	581.01	223.40	224.05
squeezenet_ssd	130.99	101.96	109.44
squeezenet_ssd_int8	156.03	77.19	80.61
mobilenet_ssd	180.59	152.79	178.19
mobilenet_ssd_int8	324.44	177.15	178.70
mobilenet_yolo	439.16	350.95	429.48
mobilenetv2_yolov3	205.10	188.08	215.03
yolov4-tiny	264.84	227.00	238.70
nanodet_m	68.40	63.08	114.33
yolo-fastest-1.1	27.54	30.24	39.48
yolo-fastestv2	26.70	34.34	37.26
vision_transformer	15111.31	3310.77	3110.80
FastestDet	30.63	38.17	40.04

4t	baseline	pr6662	pr6662-bf16s
squeezenet	14.87	13.73	13.84
squeezenet_int8	21.31	12.56	12.75
mobilenet	25.42	21.07	22.92
mobilenet_int8	42.85	26.43	26.80
mobilenet_v2	17.03	16.58	17.68
mobilenet_v3	14.99	14.21	15.89
shufflenet	11.82	10.15	12.98
shufflenet_v2	10.70	9.17	17.12
mnasnet	17.39	17.18	18.06
proxylessnasnet	20.82	20.04	18.94
efficientnet_b0	32.34	31.45	29.96
efficientnetv2_b0	35.44	33.65	34.72
regnety_400m	36.17	27.22	31.86
blazeface	3.94	2.80	3.79
googlenet	65.40	47.53	46.28
googlenet_int8	79.55	36.92	36.94
resnet18	44.34	40.81	41.21
resnet18_int8	56.32	27.14	27.66
alexnet	53.63	27.65	31.35
vgg16	258.47	217.32	213.38
vgg16_int8	293.93	168.97	167.02
resnet50	124.35	103.89	106.14
resnet50_int8	154.59	70.36	70.74
squeezenet_ssd	47.87	35.84	36.03
squeezenet_ssd_int8	49.50	31.48	31.70
mobilenet_ssd	53.55	44.46	46.67
mobilenet_ssd_int8	83.69	49.60	50.67
mobilenet_yolo	159.68	110.33	140.33
mobilenetv2_yolov3	66.82	61.07	62.33
yolov4-tiny	94.52	81.87	79.42
nanodet_m	23.56	21.38	38.06
yolo-fastest-1.1	10.15	12.44	17.80
yolo-fastestv2	10.65	15.58	16.03
vision_transformer	3950.69	930.14	936.21
FastestDet	11.37	16.08	16.13

nihui · 2026-05-05T13:43:33Z

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t	baseline	pr6662	pr6662-bf16s
squeezenet	21.25	13.95	13.04
squeezenet_int8	35.65	14.07	13.57
mobilenet	37.77	23.09	27.14
mobilenet_int8	75.81	25.75	26.37
mobilenet_v2	25.06	17.41	18.97
mobilenet_v3	19.97	13.61	16.68
shufflenet	12.67	9.11	10.26
shufflenet_v2	12.24	9.90	14.48
mnasnet	25.07	17.13	19.59
proxylessnasnet	30.95	20.29	22.43
efficientnet_b0	49.33	33.70	36.41
efficientnetv2_b0	55.41	35.45	38.92
regnety_400m	33.78	21.25	22.97
blazeface	5.34	3.06	3.05
googlenet	87.11	51.60	48.50
googlenet_int8	133.64	49.07	48.45
resnet18	68.85	46.54	41.45
resnet18_int8	114.22	40.68	40.48
alexnet	96.10	28.97	30.13
vgg16	360.85	202.68	188.24
vgg16_int8	631.80	215.56	215.99
resnet50	187.97	113.83	117.90
resnet50_int8	295.43	97.92	98.30
squeezenet_ssd	62.01	37.85	37.05
squeezenet_ssd_int8	81.80	37.15	37.27
mobilenet_ssd	75.75	48.12	56.19
mobilenet_ssd_int8	147.48	50.52	52.51
mobilenet_yolo	197.19	109.40	136.56
mobilenetv2_yolov3	86.50	60.73	66.07
yolov4-tiny	117.88	76.88	71.40
nanodet_m	28.51	22.75	32.76
yolo-fastest-1.1	11.88	9.27	13.16
yolo-fastestv2	10.55	11.84	10.32
vision_transformer	4582.75	1131.90	1044.43
FastestDet	11.81	13.16	11.67

4t	baseline	pr6662	pr6662-bf16s
squeezenet	7.81	5.30	4.51
squeezenet_int8	10.22	5.90	5.24
mobilenet	12.95	7.45	7.22
mobilenet_int8	19.32	7.81	7.87
mobilenet_v2	8.57	6.65	5.61
mobilenet_v3	6.71	5.57	5.79
shufflenet	4.62	4.25	4.37
shufflenet_v2	4.40	4.56	5.67
mnasnet	7.72	5.73	6.00
proxylessnasnet	9.19	6.42	6.58
efficientnet_b0	14.82	11.66	10.63
efficientnetv2_b0	18.47	12.99	12.73
regnety_400m	14.93	11.38	12.31
blazeface	1.84	1.19	1.16
googlenet	28.37	20.03	17.86
googlenet_int8	36.93	18.92	17.77
resnet18	25.74	20.70	19.03
resnet18_int8	32.41	15.51	15.19
alexnet	31.40	14.62	14.01
vgg16	154.29	116.29	101.91
vgg16_int8	189.38	88.30	84.73
resnet50	66.72	43.77	40.29
resnet50_int8	78.62	37.57	33.78
squeezenet_ssd	28.95	19.96	17.67
squeezenet_ssd_int8	28.73	19.23	16.18
mobilenet_ssd	28.96	17.07	15.94
mobilenet_ssd_int8	38.48	16.18	17.90
mobilenet_yolo	96.66	39.68	49.54
mobilenetv2_yolov3	35.14	25.44	21.96
yolov4-tiny	56.07	37.11	32.84
nanodet_m	10.74	10.29	12.38
yolo-fastest-1.1	4.57	4.72	6.42
yolo-fastestv2	4.33	6.10	5.43
vision_transformer	1217.27	369.71	295.17
FastestDet	4.73	6.19	5.50

massive mips and loongarch optimization

6529782

github-actions Bot added core loongarch mips labels Apr 9, 2026

opt

8a0c38d

nihui force-pushed the mips-opt3 branch from dc8fc0f to 8a0c38d Compare April 10, 2026 07:08

nihui and others added 11 commits April 10, 2026 07:10

apply code-format changes

8b2010e

wip

d1f9876

apply code-format changes

df9cac1

fix

f4bb8c7

wip

cccdcc2

wip

5e68a2f

nihui force-pushed the mips-opt3 branch from 6bbdc54 to 5e68a2f Compare April 15, 2026 02:33

github-actions Bot added the test label Apr 15, 2026

nihui and others added 3 commits April 15, 2026 02:35

apply code-format changes

19b564e

cc

e5b89af

cc

1d498d6

nihui force-pushed the mips-opt3 branch from 5431c84 to 1d498d6 Compare April 15, 2026 06:31

cc

bc43dcc

nihui force-pushed the mips-opt3 branch from 0720de1 to bc43dcc Compare April 15, 2026 07:30

nihui and others added 3 commits April 15, 2026 07:32

apply code-format changes

7ef3ae0

fix bias

d84a773

cc

39a8f45

nihui added 2 commits May 3, 2026 17:34

opt

93365d7

mips loongarch bf16 pack8 opt

5ac64b2

nihui closed this May 4, 2026

nihui reopened this May 4, 2026

apply code-format changes

1144d59

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

nihui closed this May 4, 2026

nihui reopened this May 4, 2026

opt

52cb17f

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

opt

1065cda

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

nihui closed this May 5, 2026

nihui reopened this May 5, 2026

fix

3df1f00

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

nihui and others added 6 commits May 5, 2026 16:12

fix

06fd393

a

8d2a295

apply code-format changes

1eb9cce

preload++

ce9113f

memcpy--

3895d7a

opt

bda1ba0

nihui added 3 commits May 5, 2026 20:53

opt ip

910b42d

pld

0246bf8

opt

3c7b3a0

nihui and others added 3 commits May 5, 2026 23:06

opt

21dea17

Merge branch 'Tencent:master' into mips-opt3

2ab4f7e

Merge branch 'master' into mips-opt3

a809ad8

		if(NCNN_RUNTIME_CPU AND NCNN_MMI)
		ncnn_add_arch_opt_source(${class} mmi "-mloongson-mmi")

Conversation

nihui commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

nihui commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nihui commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nihui commented Apr 9, 2026 •

edited

Loading

codecov-commenter commented Apr 9, 2026 •

edited

Loading

nihui commented May 5, 2026 •

edited

Loading