bpf, arm64: Instruction selection optimizations for the BPF JIT by puranjaymohan · Pull Request #11322 · kernel-patches/bpf

puranjaymohan · 2026-03-06T20:10:01Z

This series improves the ARM64 BPF JIT's instruction selection to emit
fewer instructions and use encodings that are more microarchitecture-friendly
on Neoverse V1/V2/N2 cores. All changes are guided by the Arm Neoverse
Software Optimization Guides.

The BPF JIT translates BPF bytecode into native ARM64 instructions at
runtime. In several places it was emitting multi-instruction sequences
where a single instruction would do, or using instruction forms that
create unnecessary pipeline serialization. This series addresses six
such cases across the branch, load/store, and comparison paths.

Branches (patches 1-3): When BPF tests a single bit (JSET with a
power-of-2 immediate), the JIT emitted TST + B.cond (2 uOPs, F+B
pipelines). This is replaced with a single TBNZ (1 uOP, B pipeline
only). Similarly, comparisons against zero (JEQ/JNE with imm=0) used
CMP + B.cond and are now a single CBZ or CBNZ. These are common
patterns in BPF programs (NULL checks, flag tests) and each saves one
instruction while freeing the flag-setting pipeline.

Tail call path (patches 4-5): Struct field accesses in the tail call
hot path used MOV + LDR (register offset) to load fields like
map.max_entries and bpf_func. Since these offsets are small constants
that fit in the scaled 12-bit immediate, they are now single LDR
(immediate offset) instructions. The tail call counter comparison
(against MAX_TAIL_CALL_CNT = 33) similarly used MOV + CMP and is now a
single CMP immediate, which additionally qualifies for CMP+B.cond
fusion on all three Neoverse cores.

Prologue/epilogue (patches 6-7): Callee-saved register saves used
pre-index STP (push), which creates a serial dependency chain through
SP — each STP must wait for the previous one to write back SP. This is
replaced with a single SUB SP followed by signed-offset STP instructions
that don't modify the base register, allowing all stores to dispatch in
parallel. The signed-offset form also uses fewer pipeline types (L01+D
vs L01+D+I), freeing an I-pipeline slot per pair. The Neoverse N2 SWOG
explicitly recommends this: "Use non-writeback forms of LDP and STP."

Stack accesses (patches 8-9): BPF stack slots use negative offsets
from the frame pointer (FP-8, FP-16, ..., FP-256). Since LDR/STR
unsigned-offset encoding cannot represent negative values, the JIT
fell back to a 2-instruction MOVN + LDR sequence. LDUR/STUR support
signed 9-bit immediates (-256 to +255) with identical latency and
throughput, so these are now single-instruction accesses. This is a no-op
because JIT fixes it using other ways (using SP with positive offset) so these
patches can be dropped.

Add aarch64_insn_gen_test_branch_imm() to generate TBZ and TBNZ instructions. The encoding uses a signed 14-bit immediate (±32 KB range) and sets the SF bit when the tested bit number is >= 32. This follows the same pattern as the existing CBZ/CBNZ generation function aarch64_insn_gen_comp_branch_imm(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

When BPF_JSET tests a power-of-2 immediate, use a single TBNZ instead of the TST+B.ne pair. TBZ/TBNZ test-and-branch in one instruction, saving an instruction and reducing code size for single-bit tests. Ref: Neoverse V2 SWOG §3.3 (TBZ/TBNZ: 1 uOP, B-pipe only), §4.11 (fusion) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

When BPF_JEQ or BPF_JNE compares against immediate 0, use a single CBZ or CBNZ instead of the CMP+B.cond pair. Comparison against zero is common in BPF programs (NULL checks, boolean tests) and this saves one instruction per such branch. Ref: Neoverse V2 SWOG §3.3 (CBZ/CBNZ: 1 uOP, B-pipe only) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

The struct field offsets accessed in emit_bpf_tail_call() (map.max_entries, ptrs, bpf_func) are small positive constants that fit in the scaled imm12 encoding. Use LDR (unsigned immediate) or ADD_I directly instead of the MOV+LDR (register offset) sequence, saving one instruction per field access in the tail call hot path. Ref: Neoverse V2 SWOG §3.8 (LDR imm same latency/throughput as register-offset) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Add LOAD_PAIR_OFFSET and STORE_PAIR_OFFSET types to aarch64_insn_gen_load_store_pair() for the signed-offset (no writeback) addressing mode. The existing insn layer only supports pre-index and post-index pair modes, both of which modify the base register. The new cases reuse aarch64_insn_get_stp_value() and aarch64_insn_get_ldp_value() already defined via __AARCH64_INSN_FUNCS. Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Replace pre-index STP (A64_PUSH) and post-index LDP (A64_POP) with a single SUB/ADD SP adjustment followed by signed-offset STP/LDP that do not write back the base register. Pre-index STP serializes through SP: each STP must wait for the previous one to update SP before computing its store address. With a single SUB SP up front, all subsequent STP instructions use independent offsets and can dispatch in parallel. The same applies in reverse for LDP in the epilogue. Ref: Neoverse V2 SWOG §3.9 (STP offset: L01,D vs pre-index: L01,D,I); N2 SWOG §4.4 Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

MAX_TAIL_CALL_CNT is 33 which fits in a 12-bit immediate. Use CMP_I directly instead of the MOVZ+CMP (register) sequence, saving one instruction in the tail call path. Ref: Neoverse V2 SWOG §4.11 (CMP imm + B.cond fuses into single operation) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Add aarch64_insn_gen_load_store_imm_unscaled() to encode LDUR/STUR instructions with signed 9-bit immediate offsets (-256 to +255). BPF stack accesses use negative offsets from the frame pointer (e.g., [FP, #-8]) which cannot be encoded as unsigned scaled immediates. With LDUR/STUR, these can be encoded directly instead of requiring a two-instruction MOV+LDR sequence. All four data sizes (8/16/32/64-bit) are supported. Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

BPF stack accesses commonly use negative offsets from the frame pointer (e.g., [FP, #-8], [FP, #-16]). Currently any negative offset falls through is_lsi_offset() and requires a two-instruction sequence: MOVN tmp, #~off LDR dst, [src, tmp] For offsets in [-256, -1], use a single LDUR or STUR instead: LDUR dst, [src, #off] Add is_ldur_offset() and insert LDUR/STUR paths into all BPF_LDX, BPF_STX, and BPF_ST handlers, after the is_lsi_offset() fast path and before the MOV+register-offset fallback. BPF_MEMSX (sign-extending loads) are not handled because LDURSW/LDURSH/LDURSB have different encodings not covered by the current infrastructure. These continue to use the MOV+LDRSW path. Ref: Neoverse V2 SWOG §3.8 (LDUR same latency/throughput as LDR unsigned-offset) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

kernel-patches-daemon-bpf · 2026-03-13T20:07:32Z

Automatically cleaning up stale PR; feel free to reopen if needed

puranjaymohan added 9 commits March 6, 2026 11:40

kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 7 times, most recently from 4b0d910 to 15b24d7 Compare March 11, 2026 18:16

kernel-patches-daemon-bpf Bot closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf, arm64: Instruction selection optimizations for the BPF JIT#11322

bpf, arm64: Instruction selection optimizations for the BPF JIT#11322
puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
puranjaymohan:arm64_jit_imps

puranjaymohan commented Mar 6, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

puranjaymohan commented Mar 6, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant