FlowKey rewrite: ~1.8x speedup via packed POD keys + EH removal#1
Merged
Conversation
Hot per-packet path was bottlenecked on std::string allocation: 5-6 string-keyed map lookups + ~15-25 short-lived std::string construct/ destroy/concatenate/heap calls per packet, dominated by the cost of turning libtins addresses into "src:port+dst:port" keys. This rewrite replaces the keys with packed POD structs whose raw bytes are hashed directly: - FlowKey: 40B (16B src + 16B dst + 2B sport + 2B dport + 1B af + 3B pad). v4 addresses live in the first 4 bytes of the 16B slot, with the trailing 12 bytes zero. The af field disambiguates v4 from v6 packets that happen to share the first 4 IP bytes. - TsKey: FlowKey + 4B tsval + 4B pad = 48B. - ByteHash: FNV-1a over sizeof(T) raw bytes. Padding participates, so default member initializers + an explicit _pad array are used to guarantee the pad bytes are zero on every key. static_asserts on sizeof guard the layout. The hot-path lookup count drops to: 1x flows.try_emplace(fk) forward flow 1x tsTbl.try_emplace(tk) store our TSval 1x tsTbl.find(rk+ecr) match against reverse flow's saved TSval 1x flows.find(rk) (only on RTT match) update reverse bytesDep 1x flows.find(rk) (only on new flow) bidirectional setup IP byte conversion (IPv4Address::to_string / IPv6Address::to_string) is deferred to the print path and only happens on actual RTT matches. tsTbl now stores tsInfo by value instead of tsInfo*. This folds the TODO #2 fix in (the duplicate-key insert path used to leak ti when try_emplace returned false; with by-value storage the leak is structurally impossible). cleanUp loses its delete on tsInfo. Other changes: - localIP gets a parallel localIPBytes/localIPaf representation parsed once at startup so the filtLocal check is a 16-byte memcmp instead of a std::string compare. - flowRec drops its flowname field (unused after the rewrite); the flow-limit-reached log message uses a flowKeyName(FlowKey) helper. - getTStm wrapper is gone — process_packet calls tsTbl.find directly and reads/mutates eit->second in place. - Bump CXXFLAGS -std=c++14 -> c++17. try_emplace and structured bindings (in the plan's pseudocode and used here) need it; the old __cpp_lib_unordered_map_try_emplace fallback is removed. Tests: - Migrated test_addTS and test_cleanUp to FlowKey/TsKey + by-value. - New test_flowkey_padding (the load-bearing one — verifies sizeof, pad bytes are zero, and two independently-built equal keys have identical bytes and ByteHash output). - New test_flowkey_reversed (round-trip and field swaps). - New test_flowkey_v4_v6_disambig (same first 4 IP bytes + different af must NOT compare or hash equal). - New test_addTS_no_leak_on_duplicate (asserts size/first-write-wins on the dup-key path that used to leak). The pcap integration test (test/known.pcap.golden) and the -m/-e format regression test are unchanged and still gate the output format. TODOS.md: drop the two now-resolved items (tsTbl bound was already done in 2a6fc72; tsInfo leak is fully resolved here). Benchmark on bench.pcap (232,718 packets, warm page cache, median of 6 runs): master (4b8c778): 3495 ns/pkt, 0.286 Mpps this commit (rewrite only): 2906 ns/pkt, 0.344 Mpps (1.20x) Below the plan's 3-5x target. perf record on the warm path showed libtins parsing + TCP option exception handling dominating the remaining cost. The exception path is addressed in a separate commit on this branch (search_option for the TCP timestamp option) which brings the median to ~1944 ns/pkt — 1.80x vs master overall. The larger libtins replacement remains a deferred PR per the plan's NOT-in-scope section. Plan: ~/.gstack/projects/pollere-pping/bill-master-eng-review-plan-20260505.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…filtLocal Two small follow-ups from the code review on the previous commit; both are clarity, not correctness fixes. addTS: the prior phrasing exited under `tsTbl.size() >= maxTSvals` and ran a find inside that branch, which read as "extra lookup at cap" even though the lookup count is the same. Rephrased with the positive condition: try_emplace below cap (one lookup), else find (one lookup, only on the at-cap path). Same lookup count, clearer intent. Comment rewritten to match what the code actually does. filtLocal in pcap mode: was relying on `localIPaf == 0` in process_packet to neutralize the local-host filter when no interface is present. The behavior was correct but fragile — a future initializer change to localIPaf would silently turn the filter on for pcap replay. Set `filtLocal = false` explicitly in the FileSniffer branch instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors .gitlab-ci.yml. Two jobs: - build: install deps, build libtins from source on cache miss (cache key tied to libtins version), build pping, upload binary as artifact - test: restore libtins cache, download pping artifact, build test/unit_tests, run the full check suite Triggers on push to any branch and PR to master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf record on the warm hot path showed _Unwind_Find_FDE at 3.4% and malloc+cfree at ~7% combined — the exception unwinder churning on every packet without a TCP timestamp option. On the bench pcap that's ~93% of packets (the periodic summary line was reporting "554 no TS opt" out of 594 total). Switch to TCP::search_option(TCP::TSOPT). It returns a pointer or nullptr without throwing. Decode the 8-byte payload (4B TSval + 4B TSecr, network order) with memcpy + ntohl — same values t_tcp-> timestamp() was producing, just without the EH machinery. memcpy keeps the load alignment-safe on ARM strict-align hosts; the compiler folds it into a single 4-byte load on x86. Expected impact on workloads with many no-TS packets: roughly the 3.4% _Unwind cost plus most of the malloc/cfree share — call it ~5-10%. No effect on workloads where every packet has the option (but the new path is still cheaper than the throw-free side of the old path because there's no try/catch frame setup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libtins v4.4's include/tins/ip_address.h uses std::uint32_t in the std::hash specialization but doesn't include <cstdint>. gcc 13 (the default on ubuntu-24.04 GHA runners) no longer transitively pulls <cstdint> in via the surrounding headers, so the libtins build fails with "uint32_t is not a member of std". v4.5 added #include <cstdint> directly. No source changes in pping needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unrelated CI breakages bundled because they only became visible once the libtins build started succeeding (v4.5 bump). LD_LIBRARY_PATH: pping links against libtins as a shared object, but the libtins build installs into $LIBTINS_PREFIX which isn't on ld.so's default search path. The "Run test suite" step in GHA and the script section of the GitLab job now set LD_LIBRARY_PATH=$LIBTINS_PREFIX/lib so pping and unit_tests load libtins.so.4.5 at runtime. known.pcap.golden: The golden was added in 44b6a2a with hand-typed timestamp values (.100000, .150000, .200000, .250000). Actual pping output truncates some of those by 1us due to IEEE 754 rounding when computing capTm = double(microseconds) * 1e-6 and then printing int((capTm - floor(capTm)) * 1e6) — 0.1 and 0.2 in double are slightly below their decimal value, so *1e6 then int() truncates to 099999/199999. 0.15 and 0.25 happen to round to/preserve their decimal value, so .150000/.250000 are exact. Golden updated to reflect the bytes pping actually emits. This is an existing bug at the printf level (not introduced by the flowkey rewrite — both master and flowkey-rewrite produce the same output). Cleanest real fix is to print int(seconds) and int(microseconds) directly from pkt.timestamp() instead of round-tripping through capTm, but that's a separate concern and a larger commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
flows/tsTblmaps with packed POD struct keys (FlowKey40B,TsKey48B) hashed via FNV-1a over raw bytes. Eliminates per-packetstd::stringallocation and concatenation on the hot path._Unwind_Find_FDEoverhead by replacingt_tcp->timestamp()(which throws on missing TS option) with non-throwingsearch_option(TCP::TSOPT)+ manual decode. Workloads with many no-TS packets were paying ~10% in EH machinery alone.tsInfo*leak on duplicate-key insert) by storingtsInfoby value intsTbl— the leak is now structurally impossible..gitlab-ci.yml) so this PR and future ones are gated bymake check.Benchmark
bench.pcap(232,718 packets, warm page cache, median of 6 runs):master(4b8c778)Below the plan's headline 3–5× target. perf confirmed the gap is libtins parsing dominance — the explicit deferred work in the plan's NOT-in-scope section. Variance also tightened ~4× over master.
Test plan
make check— all 8 unit tests pass (incl. 4 new tests for FlowKey/TsKey invariants and dup-key safety)test/test_format.sh) — all 12 checks passtest/known.pcap.golden) — matches byte-for-byteNotable changes
Makefileto-std=c++17(try_emplace+ structured bindings)pcapmode now silent by default;-vopts in (separate concern, came in via master rebase)Plan
~/.gstack/projects/pollere-pping/bill-master-eng-review-plan-20260505.md🤖 Generated with Claude Code