bpf, arm64: Instruction selection optimizations for the BPF JIT#11322
Closed
puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
Closed
bpf, arm64: Instruction selection optimizations for the BPF JIT#11322puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
puranjaymohan wants to merge 9 commits intokernel-patches:bpf-next_basefrom
Conversation
Add aarch64_insn_gen_test_branch_imm() to generate TBZ and TBNZ instructions. The encoding uses a signed 14-bit immediate (±32 KB range) and sets the SF bit when the tested bit number is >= 32. This follows the same pattern as the existing CBZ/CBNZ generation function aarch64_insn_gen_comp_branch_imm(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
When BPF_JSET tests a power-of-2 immediate, use a single TBNZ instead of the TST+B.ne pair. TBZ/TBNZ test-and-branch in one instruction, saving an instruction and reducing code size for single-bit tests. Ref: Neoverse V2 SWOG §3.3 (TBZ/TBNZ: 1 uOP, B-pipe only), §4.11 (fusion) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
When BPF_JEQ or BPF_JNE compares against immediate 0, use a single CBZ or CBNZ instead of the CMP+B.cond pair. Comparison against zero is common in BPF programs (NULL checks, boolean tests) and this saves one instruction per such branch. Ref: Neoverse V2 SWOG §3.3 (CBZ/CBNZ: 1 uOP, B-pipe only) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
The struct field offsets accessed in emit_bpf_tail_call() (map.max_entries, ptrs, bpf_func) are small positive constants that fit in the scaled imm12 encoding. Use LDR (unsigned immediate) or ADD_I directly instead of the MOV+LDR (register offset) sequence, saving one instruction per field access in the tail call hot path. Ref: Neoverse V2 SWOG §3.8 (LDR imm same latency/throughput as register-offset) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Add LOAD_PAIR_OFFSET and STORE_PAIR_OFFSET types to aarch64_insn_gen_load_store_pair() for the signed-offset (no writeback) addressing mode. The existing insn layer only supports pre-index and post-index pair modes, both of which modify the base register. The new cases reuse aarch64_insn_get_stp_value() and aarch64_insn_get_ldp_value() already defined via __AARCH64_INSN_FUNCS. Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Replace pre-index STP (A64_PUSH) and post-index LDP (A64_POP) with a single SUB/ADD SP adjustment followed by signed-offset STP/LDP that do not write back the base register. Pre-index STP serializes through SP: each STP must wait for the previous one to update SP before computing its store address. With a single SUB SP up front, all subsequent STP instructions use independent offsets and can dispatch in parallel. The same applies in reverse for LDP in the epilogue. Ref: Neoverse V2 SWOG §3.9 (STP offset: L01,D vs pre-index: L01,D,I); N2 SWOG §4.4 Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
MAX_TAIL_CALL_CNT is 33 which fits in a 12-bit immediate. Use CMP_I directly instead of the MOVZ+CMP (register) sequence, saving one instruction in the tail call path. Ref: Neoverse V2 SWOG §4.11 (CMP imm + B.cond fuses into single operation) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Add aarch64_insn_gen_load_store_imm_unscaled() to encode LDUR/STUR instructions with signed 9-bit immediate offsets (-256 to +255). BPF stack accesses use negative offsets from the frame pointer (e.g., [FP, #-8]) which cannot be encoded as unsigned scaled immediates. With LDUR/STUR, these can be encoded directly instead of requiring a two-instruction MOV+LDR sequence. All four data sizes (8/16/32/64-bit) are supported. Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
BPF stack accesses commonly use negative offsets from the frame pointer (e.g., [FP, #-8], [FP, #-16]). Currently any negative offset falls through is_lsi_offset() and requires a two-instruction sequence: MOVN tmp, #~off LDR dst, [src, tmp] For offsets in [-256, -1], use a single LDUR or STUR instead: LDUR dst, [src, #off] Add is_ldur_offset() and insert LDUR/STUR paths into all BPF_LDX, BPF_STX, and BPF_ST handlers, after the is_lsi_offset() fast path and before the MOV+register-offset fallback. BPF_MEMSX (sign-extending loads) are not handled because LDURSW/LDURSH/LDURSB have different encodings not covered by the current infrastructure. These continue to use the MOV+LDRSW path. Ref: Neoverse V2 SWOG §3.8 (LDUR same latency/throughput as LDR unsigned-offset) Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
4b0d910 to
15b24d7
Compare
|
Automatically cleaning up stale PR; feel free to reopen if needed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series improves the ARM64 BPF JIT's instruction selection to emit
fewer instructions and use encodings that are more microarchitecture-friendly
on Neoverse V1/V2/N2 cores. All changes are guided by the Arm Neoverse
Software Optimization Guides.
The BPF JIT translates BPF bytecode into native ARM64 instructions at
runtime. In several places it was emitting multi-instruction sequences
where a single instruction would do, or using instruction forms that
create unnecessary pipeline serialization. This series addresses six
such cases across the branch, load/store, and comparison paths.
Branches (patches 1-3): When BPF tests a single bit (JSET with a
power-of-2 immediate), the JIT emitted TST + B.cond (2 uOPs, F+B
pipelines). This is replaced with a single TBNZ (1 uOP, B pipeline
only). Similarly, comparisons against zero (JEQ/JNE with imm=0) used
CMP + B.cond and are now a single CBZ or CBNZ. These are common
patterns in BPF programs (NULL checks, flag tests) and each saves one
instruction while freeing the flag-setting pipeline.
Tail call path (patches 4-5): Struct field accesses in the tail call
hot path used MOV + LDR (register offset) to load fields like
map.max_entries and bpf_func. Since these offsets are small constants
that fit in the scaled 12-bit immediate, they are now single LDR
(immediate offset) instructions. The tail call counter comparison
(against MAX_TAIL_CALL_CNT = 33) similarly used MOV + CMP and is now a
single CMP immediate, which additionally qualifies for CMP+B.cond
fusion on all three Neoverse cores.
Prologue/epilogue (patches 6-7): Callee-saved register saves used
pre-index STP (push), which creates a serial dependency chain through
SP — each STP must wait for the previous one to write back SP. This is
replaced with a single SUB SP followed by signed-offset STP instructions
that don't modify the base register, allowing all stores to dispatch in
parallel. The signed-offset form also uses fewer pipeline types (L01+D
vs L01+D+I), freeing an I-pipeline slot per pair. The Neoverse N2 SWOG
explicitly recommends this: "Use non-writeback forms of LDP and STP."
Stack accesses (patches 8-9): BPF stack slots use negative offsets
from the frame pointer (FP-8, FP-16, ..., FP-256). Since LDR/STR
unsigned-offset encoding cannot represent negative values, the JIT
fell back to a 2-instruction MOVN + LDR sequence. LDUR/STUR support
signed 9-bit immediates (-256 to +255) with identical latency and
throughput, so these are now single-instruction accesses. This is a no-op
because JIT fixes it using other ways (using SP with positive offset) so these
patches can be dropped.