Skip to content

FlowKey rewrite: ~1.8x speedup via packed POD keys + EH removal#1

Merged
nshopik merged 6 commits into
masterfrom
flowkey-rewrite
May 5, 2026
Merged

FlowKey rewrite: ~1.8x speedup via packed POD keys + EH removal#1
nshopik merged 6 commits into
masterfrom
flowkey-rewrite

Conversation

@nshopik
Copy link
Copy Markdown
Owner

@nshopik nshopik commented May 5, 2026

Summary

  • Replace string-keyed flows/tsTbl maps with packed POD struct keys (FlowKey 40B, TsKey 48B) hashed via FNV-1a over raw bytes. Eliminates per-packet std::string allocation and concatenation on the hot path.
  • Remove _Unwind_Find_FDE overhead by replacing t_tcp->timestamp() (which throws on missing TS option) with non-throwing search_option(TCP::TSOPT) + manual decode. Workloads with many no-TS packets were paying ~10% in EH machinery alone.
  • Folds in TODO Add SEQ/ACK RTT measurement path with hybrid mode #2 (tsInfo* leak on duplicate-key insert) by storing tsInfo by value in tsTbl — the leak is now structurally impossible.
  • Adds GitHub Actions CI (mirrors existing .gitlab-ci.yml) so this PR and future ones are gated by make check.

Benchmark

bench.pcap (232,718 packets, warm page cache, median of 6 runs):

Branch ns/pkt Mpps vs master
master (4b8c778) 3495 0.286 1.00×
FlowKey rewrite alone 2906 0.344 1.20×
+ search_option (this PR) 1944 0.515 1.80×

Below the plan's headline 3–5× target. perf confirmed the gap is libtins parsing dominance — the explicit deferred work in the plan's NOT-in-scope section. Variance also tightened ~4× over master.

Test plan

  • make check — all 8 unit tests pass (incl. 4 new tests for FlowKey/TsKey invariants and dup-key safety)
  • Format regression (test/test_format.sh) — all 12 checks pass
  • pcap golden file (test/known.pcap.golden) — matches byte-for-byte
  • Wall-clock benchmark — see numbers above
  • perf profiling — confirmed bottleneck shifted from string ops to libtins parsing
  • Live capture smoke test — not run (no test environment)

Notable changes

  • Bumped Makefile to -std=c++17 (try_emplace + structured bindings)
  • Dropped two TODOs already resolved: tsTbl bound (done in 2a6fc72) and tsInfo leak (resolved here)
  • pcap mode now silent by default; -v opts in (separate concern, came in via master rebase)

Plan

~/.gstack/projects/pollere-pping/bill-master-eng-review-plan-20260505.md

🤖 Generated with Claude Code

nshopik and others added 6 commits May 5, 2026 15:11
Hot per-packet path was bottlenecked on std::string allocation: 5-6
string-keyed map lookups + ~15-25 short-lived std::string construct/
destroy/concatenate/heap calls per packet, dominated by the cost of
turning libtins addresses into "src:port+dst:port" keys.

This rewrite replaces the keys with packed POD structs whose raw bytes
are hashed directly:

- FlowKey: 40B (16B src + 16B dst + 2B sport + 2B dport + 1B af + 3B
  pad). v4 addresses live in the first 4 bytes of the 16B slot, with
  the trailing 12 bytes zero. The af field disambiguates v4 from v6
  packets that happen to share the first 4 IP bytes.
- TsKey: FlowKey + 4B tsval + 4B pad = 48B.
- ByteHash: FNV-1a over sizeof(T) raw bytes. Padding participates,
  so default member initializers + an explicit _pad array are used to
  guarantee the pad bytes are zero on every key. static_asserts on
  sizeof guard the layout.

The hot-path lookup count drops to:
  1x flows.try_emplace(fk)   forward flow
  1x tsTbl.try_emplace(tk)   store our TSval
  1x tsTbl.find(rk+ecr)      match against reverse flow's saved TSval
  1x flows.find(rk)          (only on RTT match) update reverse bytesDep
  1x flows.find(rk)          (only on new flow) bidirectional setup

IP byte conversion (IPv4Address::to_string / IPv6Address::to_string)
is deferred to the print path and only happens on actual RTT matches.

tsTbl now stores tsInfo by value instead of tsInfo*. This folds the
TODO #2 fix in (the duplicate-key insert path used to leak ti when
try_emplace returned false; with by-value storage the leak is
structurally impossible). cleanUp loses its delete on tsInfo.

Other changes:
- localIP gets a parallel localIPBytes/localIPaf representation
  parsed once at startup so the filtLocal check is a 16-byte memcmp
  instead of a std::string compare.
- flowRec drops its flowname field (unused after the rewrite); the
  flow-limit-reached log message uses a flowKeyName(FlowKey) helper.
- getTStm wrapper is gone — process_packet calls tsTbl.find directly
  and reads/mutates eit->second in place.
- Bump CXXFLAGS -std=c++14 -> c++17. try_emplace and structured
  bindings (in the plan's pseudocode and used here) need it; the old
  __cpp_lib_unordered_map_try_emplace fallback is removed.

Tests:
- Migrated test_addTS and test_cleanUp to FlowKey/TsKey + by-value.
- New test_flowkey_padding (the load-bearing one — verifies sizeof,
  pad bytes are zero, and two independently-built equal keys have
  identical bytes and ByteHash output).
- New test_flowkey_reversed (round-trip and field swaps).
- New test_flowkey_v4_v6_disambig (same first 4 IP bytes + different
  af must NOT compare or hash equal).
- New test_addTS_no_leak_on_duplicate (asserts size/first-write-wins
  on the dup-key path that used to leak).

The pcap integration test (test/known.pcap.golden) and the -m/-e
format regression test are unchanged and still gate the output format.

TODOS.md: drop the two now-resolved items (tsTbl bound was already
done in 2a6fc72; tsInfo leak is fully resolved here).

Benchmark on bench.pcap (232,718 packets, warm page cache, median of
6 runs):

  master (4b8c778):           3495 ns/pkt, 0.286 Mpps
  this commit (rewrite only): 2906 ns/pkt, 0.344 Mpps   (1.20x)

Below the plan's 3-5x target. perf record on the warm path showed
libtins parsing + TCP option exception handling dominating the
remaining cost. The exception path is addressed in a separate commit
on this branch (search_option for the TCP timestamp option) which
brings the median to ~1944 ns/pkt — 1.80x vs master overall. The
larger libtins replacement remains a deferred PR per the plan's
NOT-in-scope section.

Plan: ~/.gstack/projects/pollere-pping/bill-master-eng-review-plan-20260505.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…filtLocal

Two small follow-ups from the code review on the previous commit; both
are clarity, not correctness fixes.

addTS: the prior phrasing exited under `tsTbl.size() >= maxTSvals` and
ran a find inside that branch, which read as "extra lookup at cap" even
though the lookup count is the same. Rephrased with the positive
condition: try_emplace below cap (one lookup), else find (one lookup,
only on the at-cap path). Same lookup count, clearer intent. Comment
rewritten to match what the code actually does.

filtLocal in pcap mode: was relying on `localIPaf == 0` in
process_packet to neutralize the local-host filter when no interface
is present. The behavior was correct but fragile — a future
initializer change to localIPaf would silently turn the filter on for
pcap replay. Set `filtLocal = false` explicitly in the FileSniffer
branch instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors .gitlab-ci.yml. Two jobs:
- build: install deps, build libtins from source on cache miss
  (cache key tied to libtins version), build pping, upload binary
  as artifact
- test: restore libtins cache, download pping artifact, build
  test/unit_tests, run the full check suite

Triggers on push to any branch and PR to master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf record on the warm hot path showed _Unwind_Find_FDE at 3.4% and
malloc+cfree at ~7% combined — the exception unwinder churning on
every packet without a TCP timestamp option. On the bench pcap that's
~93% of packets (the periodic summary line was reporting "554 no TS
opt" out of 594 total).

Switch to TCP::search_option(TCP::TSOPT). It returns a pointer or
nullptr without throwing. Decode the 8-byte payload (4B TSval + 4B
TSecr, network order) with memcpy + ntohl — same values t_tcp->
timestamp() was producing, just without the EH machinery.

memcpy keeps the load alignment-safe on ARM strict-align hosts; the
compiler folds it into a single 4-byte load on x86.

Expected impact on workloads with many no-TS packets: roughly the
3.4% _Unwind cost plus most of the malloc/cfree share — call it
~5-10%. No effect on workloads where every packet has the option
(but the new path is still cheaper than the throw-free side of the
old path because there's no try/catch frame setup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libtins v4.4's include/tins/ip_address.h uses std::uint32_t in the
std::hash specialization but doesn't include <cstdint>. gcc 13 (the
default on ubuntu-24.04 GHA runners) no longer transitively pulls
<cstdint> in via the surrounding headers, so the libtins build fails
with "uint32_t is not a member of std".

v4.5 added #include <cstdint> directly. No source changes in pping
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unrelated CI breakages bundled because they only became visible
once the libtins build started succeeding (v4.5 bump).

LD_LIBRARY_PATH:
  pping links against libtins as a shared object, but the libtins
  build installs into $LIBTINS_PREFIX which isn't on ld.so's default
  search path. The "Run test suite" step in GHA and the script section
  of the GitLab job now set LD_LIBRARY_PATH=$LIBTINS_PREFIX/lib so
  pping and unit_tests load libtins.so.4.5 at runtime.

known.pcap.golden:
  The golden was added in 44b6a2a with hand-typed timestamp values
  (.100000, .150000, .200000, .250000). Actual pping output truncates
  some of those by 1us due to IEEE 754 rounding when computing
    capTm = double(microseconds) * 1e-6
  and then printing
    int((capTm - floor(capTm)) * 1e6)
  — 0.1 and 0.2 in double are slightly below their decimal value, so
  *1e6 then int() truncates to 099999/199999. 0.15 and 0.25 happen to
  round to/preserve their decimal value, so .150000/.250000 are exact.
  Golden updated to reflect the bytes pping actually emits.

  This is an existing bug at the printf level (not introduced by the
  flowkey rewrite — both master and flowkey-rewrite produce the same
  output). Cleanest real fix is to print int(seconds) and int(microseconds)
  directly from pkt.timestamp() instead of round-tripping through capTm,
  but that's a separate concern and a larger commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nshopik nshopik merged commit 0a519c4 into master May 5, 2026
4 checks passed
@nshopik nshopik deleted the flowkey-rewrite branch May 5, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant