Skip to content

bpf, arm64: Instruction selection optimizations for the BPF JIT#11322

Closed
puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
puranjaymohan:arm64_jit_imps
Closed

bpf, arm64: Instruction selection optimizations for the BPF JIT#11322
puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
puranjaymohan:arm64_jit_imps

Conversation

@puranjaymohan
Copy link
Copy Markdown
Collaborator

This series improves the ARM64 BPF JIT's instruction selection to emit
fewer instructions and use encodings that are more microarchitecture-friendly
on Neoverse V1/V2/N2 cores. All changes are guided by the Arm Neoverse
Software Optimization Guides.

The BPF JIT translates BPF bytecode into native ARM64 instructions at
runtime. In several places it was emitting multi-instruction sequences
where a single instruction would do, or using instruction forms that
create unnecessary pipeline serialization. This series addresses six
such cases across the branch, load/store, and comparison paths.

Branches (patches 1-3): When BPF tests a single bit (JSET with a
power-of-2 immediate), the JIT emitted TST + B.cond (2 uOPs, F+B
pipelines). This is replaced with a single TBNZ (1 uOP, B pipeline
only). Similarly, comparisons against zero (JEQ/JNE with imm=0) used
CMP + B.cond and are now a single CBZ or CBNZ. These are common
patterns in BPF programs (NULL checks, flag tests) and each saves one
instruction while freeing the flag-setting pipeline.

Tail call path (patches 4-5): Struct field accesses in the tail call
hot path used MOV + LDR (register offset) to load fields like
map.max_entries and bpf_func. Since these offsets are small constants
that fit in the scaled 12-bit immediate, they are now single LDR
(immediate offset) instructions. The tail call counter comparison
(against MAX_TAIL_CALL_CNT = 33) similarly used MOV + CMP and is now a
single CMP immediate, which additionally qualifies for CMP+B.cond
fusion on all three Neoverse cores.

Prologue/epilogue (patches 6-7): Callee-saved register saves used
pre-index STP (push), which creates a serial dependency chain through
SP — each STP must wait for the previous one to write back SP. This is
replaced with a single SUB SP followed by signed-offset STP instructions
that don't modify the base register, allowing all stores to dispatch in
parallel. The signed-offset form also uses fewer pipeline types (L01+D
vs L01+D+I), freeing an I-pipeline slot per pair. The Neoverse N2 SWOG
explicitly recommends this: "Use non-writeback forms of LDP and STP."

Stack accesses (patches 8-9): BPF stack slots use negative offsets
from the frame pointer (FP-8, FP-16, ..., FP-256). Since LDR/STR
unsigned-offset encoding cannot represent negative values, the JIT
fell back to a 2-instruction MOVN + LDR sequence. LDUR/STUR support
signed 9-bit immediates (-256 to +255) with identical latency and
throughput, so these are now single-instruction accesses. This is a no-op
because JIT fixes it using other ways (using SP with positive offset) so these
patches can be dropped.

Add aarch64_insn_gen_test_branch_imm() to generate TBZ and TBNZ
instructions. The encoding uses a signed 14-bit immediate (±32 KB
range) and sets the SF bit when the tested bit number is >= 32.

This follows the same pattern as the existing CBZ/CBNZ generation
function aarch64_insn_gen_comp_branch_imm().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
When BPF_JSET tests a power-of-2 immediate, use a single TBNZ instead
of the TST+B.ne pair. TBZ/TBNZ test-and-branch in one instruction,
saving an instruction and reducing code size for single-bit tests.

Ref: Neoverse V2 SWOG §3.3 (TBZ/TBNZ: 1 uOP, B-pipe only), §4.11 (fusion)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
When BPF_JEQ or BPF_JNE compares against immediate 0, use a single
CBZ or CBNZ instead of the CMP+B.cond pair. Comparison against zero
is common in BPF programs (NULL checks, boolean tests) and this saves
one instruction per such branch.

Ref: Neoverse V2 SWOG §3.3 (CBZ/CBNZ: 1 uOP, B-pipe only)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
The struct field offsets accessed in emit_bpf_tail_call()
(map.max_entries, ptrs, bpf_func) are small positive constants that
fit in the scaled imm12 encoding. Use LDR (unsigned immediate) or
ADD_I directly instead of the MOV+LDR (register offset) sequence,
saving one instruction per field access in the tail call hot path.

Ref: Neoverse V2 SWOG §3.8 (LDR imm same latency/throughput as register-offset)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Add LOAD_PAIR_OFFSET and STORE_PAIR_OFFSET types to
aarch64_insn_gen_load_store_pair() for the signed-offset (no writeback)
addressing mode. The existing insn layer only supports pre-index and
post-index pair modes, both of which modify the base register.

The new cases reuse aarch64_insn_get_stp_value() and
aarch64_insn_get_ldp_value() already defined via __AARCH64_INSN_FUNCS.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Replace pre-index STP (A64_PUSH) and post-index LDP (A64_POP) with a
single SUB/ADD SP adjustment followed by signed-offset STP/LDP that
do not write back the base register.

Pre-index STP serializes through SP: each STP must wait for the
previous one to update SP before computing its store address. With a
single SUB SP up front, all subsequent STP instructions use independent
offsets and can dispatch in parallel. The same applies in reverse for
LDP in the epilogue.

Ref: Neoverse V2 SWOG §3.9 (STP offset: L01,D vs pre-index: L01,D,I); N2 SWOG §4.4

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
MAX_TAIL_CALL_CNT is 33 which fits in a 12-bit immediate. Use CMP_I
directly instead of the MOVZ+CMP (register) sequence, saving one
instruction in the tail call path.

Ref: Neoverse V2 SWOG §4.11 (CMP imm + B.cond fuses into single operation)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Add aarch64_insn_gen_load_store_imm_unscaled() to encode LDUR/STUR
instructions with signed 9-bit immediate offsets (-256 to +255).

BPF stack accesses use negative offsets from the frame pointer (e.g.,
[FP, #-8]) which cannot be encoded as unsigned scaled immediates. With
LDUR/STUR, these can be encoded directly instead of requiring a
two-instruction MOV+LDR sequence. All four data sizes (8/16/32/64-bit)
are supported.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
BPF stack accesses commonly use negative offsets from the frame pointer
(e.g., [FP, #-8], [FP, #-16]). Currently any negative offset falls
through is_lsi_offset() and requires a two-instruction sequence:

  MOVN tmp, #~off
  LDR  dst, [src, tmp]

For offsets in [-256, -1], use a single LDUR or STUR instead:

  LDUR dst, [src, #off]

Add is_ldur_offset() and insert LDUR/STUR paths into all BPF_LDX,
BPF_STX, and BPF_ST handlers, after the is_lsi_offset() fast path and
before the MOV+register-offset fallback.

BPF_MEMSX (sign-extending loads) are not handled because
LDURSW/LDURSH/LDURSB have different encodings not covered by the
current infrastructure. These continue to use the MOV+LDRSW path.

Ref: Neoverse V2 SWOG §3.8 (LDUR same latency/throughput as LDR unsigned-offset)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 7 times, most recently from 4b0d910 to 15b24d7 Compare March 11, 2026 18:16
@kernel-patches-daemon-bpf
Copy link
Copy Markdown

Automatically cleaning up stale PR; feel free to reopen if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant