zero copy read impl#127
Open
victorstewart wants to merge 9 commits intofraillt:masterfrom
Open
Conversation
528c866 to
96b7436
Compare
79f2842 to
b37a954
Compare
Restore include/bitsery/serializer.h to the master implementation and move\noffset-table recording into a dedicated OffsetTableWriteSerializer used\nonly by serializeWithOffsetTable.\n\nThis redesign means callers that are not serializing for zero-copy read\nstay on the plain Serializer path and pay 0 added runtime cost from the\noffset-table feature.\n\nIt also improves the zero-copy serialization path with cheaper recorded\nentries plus static-root, registry-metadata, and last-dynamic-cache-hit\nlookups on the offset-table side.\n\nMeasured on build-release/tests/bitsery.test.offset_table_perf:\n- Sample (50k iters): quickSerialization=0.47-0.49ms, offsetTableEnabled=0.57-0.61ms\n- KitchenSink (5k iters): quickSerialization=0.26-0.27ms, offsetTableEnabled=0.91-0.94ms\n\nAlso fix the GCC -Wmaybe-uninitialized std::variant deserialize path by\nusing variant::emplace instead of temporary-variant assignment.
Remove the unused tableOffsets/payloadSize parameters from emitTableFromCapture.\n\nClang and AppleClang treat those dead parameters as -Wunused-parameter errors\nin the PR CI matrix, so the offset-table targets failed to build even though\nGCC was green. The helper only serves the flat capture fast path and does not\nneed either value.
Reduce the zero-copy serialization code surface while preserving the specialized offset-table fast path. This keeps the feature-off serialization path clean, collapses the manual offset serializer onto the specialized writer, reuses the normal table/trailer writer for capture, and shares only the shorthand/archive surface between the plain and offset serializers. Compared with the baseline for this reduction pass, the final warm perf runs improved to: - Sample (50k): quickSerialization 0.47ms, offsetTableEnabled 0.56-0.57ms (was 0.50-0.53ms and 0.61-0.65ms) - KitchenSink (5k): quickSerialization 0.27-0.30ms, offsetTableEnabled 0.94-0.97ms (was 0.28-0.30ms and 1.00-1.07ms) This pass also reduced the branch code surface: the net delta versus master dropped from +4132 lines to +3657 lines, and this commit itself is 309 insertions vs 784 deletions in include/bitsery.
Rewrite the zero-copy-read reflection path around GCC16 C++26 reflection-generated schemas, payload traversal, and table finalization. The generated path now bypasses the generic offset-table serializer for serializer-free reflected aggregates while preserving the normal serializer path for custom bitsery wire policy. Final GCC16 measurements from tests/offset_table_perf.cpp show the reflected zero-copy-read generated dynamic path at 0.270ms for 5,000 serializations, about 54ns each. The generic bitsery serializer plus offset-table path measured 0.635ms for the same 5,000 serializations, about 127ns each, making the reflected path about 2.35x faster for that benchmark. For small fixed-layout data, plain quickSerialization measured 0.510ms for 50,000 serializations, about 10.2ns each, while reflected direct zero-copy-read with offset-table metadata measured 0.550ms for 50,000 serializations, about 11.0ns each. In that shape, zero-copy-read driven serialization is effectively free: it emits the metadata needed for borrowed reads while staying within about 0.8ns per object of regular bitsery serialization. This branch is intentionally GCC16-only for reflection experiments because the implementation depends on C++26 reflection support and -freflection. Clang/MSVC should stay on the non-reflection fallback until upstream Clang grows compatible reflection support. Verification: GCC16 reflection CTest 389/389 enabled tests passed; GCC16 no-reflection offset-table subset 26/26 plus check_includes passed; disabled perf tests were run separately with pinned CPU measurements; git diff --check passed.
Contributor
Author
|
with Reflections in GCC16 now we can get zero-copy-read enabled serialization for free! |
Shrink the offset-table trailer from 16 bytes to 14 bytes by removing the reserved field and keeping rootTableOff at offset 8 for cheap parsing. Preserve the 16-byte Entry stride after A/B testing showed 15-byte packed entries regressed zero-copy view reads. A/B notes against HEAD on GCC 15.2.1 with taskset -c 3: sample payload size 108 -> 106 bytes, kitchen payload size 554 -> 552 bytes. Final compact-trailer run kept sample_view neutral at 64.849ms vs baseline 65.163/65.545ms, kitchen_verify at 234.219ms vs 231.572/234.975ms, kitchen_offset at 657.397ms vs 657.950/655.664ms, and sample_offset at 661.292ms vs warmed baseline 667.748ms. Verification: cmake --build build-release -j48; ctest --test-dir build-release --output-on-failure -j48 passed 387/387, with the two disabled offset-table perf tests not run by CTest.
Rewrite the zero-copy-read branch around schema-known reflected C++ types instead of a generic self-describing offset table. The typed wire path now emits the same payload bytes as normal bitsery, with no table, header, or trailer. A first member named version remains a normal payload field and is checked on read; unversioned types use append-only compatibility; dynamic fields keep bitsery's inline size prefixes. Delete the old offset-table adapter, registry, details implementation, inspect/view headers, reader/serializer tests, and offset-table perf harness. Add typed-wire tests for exact normal-bitsery byte parity, fixed no-header payloads, version mismatch rejection, append-only reads, dynamic inline lengths, nested reflected objects, and the no-reflection fallback. Verification: GCC16 full suite passed 368/368 enabled tests; GCC16 debug typed-wire passed 7/7 plus check_includes; no-reflection fallback passed 1/1 plus check_includes; git diff --cached --check was clean. Perf medians over 5 release runs versus quickSerialization: application-state 1.05ms vs 1.05ms for 100k writes, 1 byte; activity 0.54ms vs 0.59ms for 50k writes, 24 bytes; versioned-state 0.58ms vs 0.53ms for 50k writes, 8 bytes; static-sample 0.58ms vs 0.58ms for 50k writes, 40 bytes; dynamic-sample 0.15ms vs 0.15ms for 5k writes, 236 bytes; kitchen-sink 0.31ms vs 0.31ms for 5k writes, 418 bytes.
Contributor
Author
|
used reflection to dump the offset table... now zero-copy-read generated buffers are byte identical to regular bitsery buffers and equal cost and reflection handles the reading via leveraging the C++ type information! so no reason not to integrate this work now! |
Now that typed wire buffers are just normal bitsery bytes, remove the old serializer split and buffer-view alias from the branch. Keep regular bitsery as the only writer and make typed_wire.h a schema-known reflected read/view layer over those bytes. The remaining branch delta against master is intentionally small: typed_wire.h, typed_wire tests/perf, and the existing GCC16 std_optional/std_variant warning fixes needed to keep the full reflection build green. Verification: CMake regenerated for GCC16 release/debug/no-reflection; GCC16 release build passed; full CTest passed 368/368 enabled tests; debug typed-wire passed 7/7; no-reflection fallback passed 1/1; check_includes passed in all three build trees; staged whitespace check passed with cr-at-eol allowed for restoring master CRLF serializer.h. Read perf medians over five release runs versus quickDeserialization: application-state 0.08ms vs 0.08ms for 100k reads, activity 0.04ms vs 0.04ms for 50k, versioned-state 0.04ms vs 0.05ms for 50k, static-sample 0.04ms vs 0.04ms for 50k, dynamic-sample 0.10ms vs 0.01ms for 5k, kitchen-sink 0.26ms vs 0.11ms for 5k.
Contributor
Author
|
now this PR is nothing more than a reflection driven zero copy reader, leverage the C++ type information to read generic bitsery serialized buffers |
CI jobs without C++26 reflection still compile the disabled typed_wire_perf helper, which names makeTypedWireView at template definition time. The no-reflection branch only declared TypedWireView, so GCC/Clang/MSVC builds failed before tests. Add no-reflection factory overloads that return Status::NoReflection views and cover both overloads in the fallback test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
this PR enables the ability to enable the construction of an offset table during serialization, that enables zero copy reads out of a Bitsery serialized buffer.
several years ago i mentioned wanting this zero copy functionality (similar to Flatbuffers), and i think i finally thought of the perfect way to do it. (and once we get Reflection in C++26, we can drop the FieldRegistry code completely and it can all be automatic). i ran some perf tests and this appears on my side to be zero-cost, as in you don't pay for it at all unless you enable it.
i'm sure you'll have some thoughts, so let's discuss if you want any redesigns.