Skip to content

B4/rhash#11300

Closed
mykyta5 wants to merge 13 commits intokernel-patches:bpf-next_basefrom
mykyta5:b4/rhash
Closed

B4/rhash#11300
mykyta5 wants to merge 13 commits intokernel-patches:bpf-next_basefrom
mykyta5:b4/rhash

Conversation

@mykyta5
Copy link
Copy Markdown
Collaborator

@mykyta5 mykyta5 commented Mar 5, 2026

No description provided.

mykyta5 added 13 commits March 3, 2026 15:33
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.

The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:

1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
 under-provisioning hurts performance)

BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.

The implementation wraps the kernel's rhashtable with BPF map operations:

- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- max_entries serves as a hard limit, not bucket count

The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- BPF_F_LOCK tests with concurrent access
- Stress tests for get_next_key during concurrent resize operations
- Seq file tests

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

---
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer, so RCU readers always see a complete value. BPF_MAP_TYPE_RHASH
does memcpy in place with no lock held.
rhash trades consistency for speed (5x improvement in update benchmark):
concurrent readers can observe partially updated data. Two concurrent
writers to the same key can also interleave, producing mixed values.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.
Summary of the read consistency guarantees:
  map type     |  write mechanism |  read consistency
  -------------+------------------+--------------------------
  htab         |  alloc, swap ptr |  always consistent (RCU)
  htab  F_LOCK |  in-place + lock |  consistent if reader locks
  -------------+------------------+--------------------------
  rhtab        |  in-place memcpy |  torn reads
  rhtab F_LOCK |  in-place + lock |  consistent if reader locks

Changes in v2:
- Added benchmarks
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com

--- b4-submit-tracking ---
{
  "series": {
    "revision": 2,
    "change-id": "20251103-rhash-7b70069923d8",
    "prefixes": [
      "RFC bpf-next"
    ],
    "history": {
      "v1": [
        "20260205-rhash-v1-0-30dd6d63c462@meta.com"
      ]
    }
  }
}
Add resizable hash map into enums where it is needed.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Introduce basic operations for BPF_MAP_TYPE_RHASH, a new hash map type
built on top of the kernel's rhashtable.

Key implementation details:
- Uses rhashtable for automatic resizing with RCU-safe operations
- Elements allocated via bpf_mem_alloc for lock-free allocation
- Supports BPF_F_LOCK for spin_lock protected values
- Requires BPF_F_NO_PREALLOC

Implemented map operations:
 * map_alloc/map_free: Initialize and destroy the rhashtable
 * map_lookup_elem: RCU-protected lookup via rhashtable_lookup
 * map_update_elem: Insert or update with BPF_NOEXIST/EXIST/ANY
 * map_delete_elem: Remove element with RCU-deferred freeing
 * map_get_next_key: Returns the next key in the table
 * map_release_uref: Free internal structs (timers, workqueues)

Other operations (batch, seq file) are implemented in the next patch

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add batch operations and BPF iterator support for BPF_MAP_TYPE_RHASH.

Batch operations:
 * rhtab_map_lookup_batch: Bulk lookup of elements by bucket
 * rhtab_map_lookup_and_delete_batch: Atomic bulk lookup and delete

The batch implementation iterates through buckets under RCU protection,
copying keys and values to userspace buffers. When the buffer fills
mid-bucket, it rolls back to the bucket boundary so the next call can
retry that bucket completely.

BPF iterator:
 * Uses rhashtable_walk_* API for safe iteration
 * Handles -EAGAIN during table resize transparently
 * Tracks skip_elems to resume iteration across read() calls

Also implements rhtab_map_mem_usage() to report memory consumption.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add tests validating resizable hash map handles BPF_F_LOCK flag as
expected.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Test get_next_key behavior under concurrent modification:
 * Resize test: verify all elements visited after resize trigger
 * Stress test: concurrent iterators and modifiers to detect races

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Test BPF iterator functionality for BPF_MAP_TYPE_RHASH:
 * Basic iteration verifying all elements are visited
 * Overflow test triggering seq_file restart, validating correct
resume behavior via skip_elems tracking

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Make bpftool documentation aware of the resizable hash map.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Support resizable hashmap in BPF map benchmarks.

Results:
$ sudo ./bench -w3 -d10 -a bpf-rhashmap-full-update
0:hash_map_full_perf 21641414 events per sec

$ sudo ./bench -w3 -d10 -a bpf-hashmap-full-update
0:hash_map_full_perf 4392758 events per sec

$ sudo ./bench -w3 -d10 -a -p8 htab-mem --use-case overwrite --value-size 8
Iter   0 (302.834us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   1 (-44.810us): per-prod-op   62.81k/s, memory usage    2.70MiB
Iter   2 (-45.821us): per-prod-op   62.81k/s, memory usage    2.70MiB
Iter   3 (-63.658us): per-prod-op   62.92k/s, memory usage    2.70MiB
Iter   4 ( 32.887us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   5 (-76.948us): per-prod-op   62.75k/s, memory usage    2.70MiB
Iter   6 (157.235us): per-prod-op   63.01k/s, memory usage    2.70MiB
Iter   7 (-118.761us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   8 (127.139us): per-prod-op   62.92k/s, memory usage    2.70MiB
Iter   9 (-169.908us): per-prod-op   62.99k/s, memory usage    2.70MiB
Iter  10 (101.962us): per-prod-op   62.97k/s, memory usage    2.70MiB
Iter  11 (-64.330us): per-prod-op   63.05k/s, memory usage    2.70MiB
Iter  12 (-20.543us): per-prod-op   62.86k/s, memory usage    2.70MiB
Iter  13 ( 55.382us): per-prod-op   62.95k/s, memory usage    2.70MiB
Summary: per-prod-op   62.92 ±    0.09k/s, memory usage    2.70 ±    0.00MiB, peak memory usage    2.96MiB

$ sudo ./bench -w3 -d10 -a -p8 rhtab-mem --use-case overwrite --value-size 8
Iter   0 (316.805us): per-prod-op   96.40k/s, memory usage    2.71MiB
Iter   1 (-35.225us): per-prod-op   96.54k/s, memory usage    2.71MiB
Iter   2 (-12.431us): per-prod-op   96.54k/s, memory usage    2.71MiB
Iter   3 (-56.537us): per-prod-op   96.58k/s, memory usage    2.71MiB
Iter   4 ( 27.108us): per-prod-op   96.62k/s, memory usage    2.71MiB
Iter   5 (-52.491us): per-prod-op   96.57k/s, memory usage    2.71MiB
Iter   6 ( -2.777us): per-prod-op   96.52k/s, memory usage    2.71MiB
Iter   7 (108.963us): per-prod-op   96.45k/s, memory usage    2.71MiB
Iter   8 (-61.575us): per-prod-op   96.48k/s, memory usage    2.71MiB
Iter   9 (-21.595us): per-prod-op   96.14k/s, memory usage    2.71MiB
Iter  10 (  3.243us): per-prod-op   96.36k/s, memory usage    2.71MiB
Iter  11 (  3.102us): per-prod-op   94.70k/s, memory usage    2.71MiB
Iter  12 (109.102us): per-prod-op   95.77k/s, memory usage    2.71MiB
Iter  13 ( 16.153us): per-prod-op   95.91k/s, memory usage    2.71MiB
Summary: per-prod-op   96.19 ±    0.57k/s, memory usage    2.71 ±    0.00MiB, peak memory usage    2.71MiB

sudo ./bench -w3 -d10 -a bpf-hashmap-lookup --key_size 4\
  --max_entries 1000 --nr_entries 500 --nr_loops 1000000
cpu00: lookup 28.603M ± 0.536M events/sec (approximated from 32 samples of ~34ms)

sudo ./bench -w3 -d10 -a bpf-rhashmap-lookup --key_size 4\
  --max_entries 1000 --nr_entries 500 --nr_loops 1000000
cpu00: lookup 27.340M ± 0.864M events/sec (approximated from 32 samples of ~36ms)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
The else-if and else branches in rht_key_get_hash() both compute a hash
using either params.hashfn or jhash, differing only in the source of
key_len (params.key_len vs ht->p.key_len). Merge the two branches into
one by using the ternary `params.key_len ?: ht->p.key_len` to select
the key length, removing the duplicated logic.

This also improves the performance of the else branch which previously
always used jhash and never fell through to jhash2. This branch is going
to be used by BPF resizable hashmap, which wraps rhashtable:
https://lore.kernel.org/bpf/20260205-rhash-v1-0-30dd6d63c462@meta.com/

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 7 times, most recently from b72a510 to ebefa82 Compare March 11, 2026 01:10
@kernel-patches-daemon-bpf
Copy link
Copy Markdown

Automatically cleaning up stale PR; feel free to reopen if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant