Skip to content

zero copy read impl#127

Open
victorstewart wants to merge 9 commits intofraillt:masterfrom
victorstewart:zero-copy-read
Open

zero copy read impl#127
victorstewart wants to merge 9 commits intofraillt:masterfrom
victorstewart:zero-copy-read

Conversation

@victorstewart
Copy link
Copy Markdown
Contributor

@victorstewart victorstewart commented Nov 24, 2025

this PR enables the ability to enable the construction of an offset table during serialization, that enables zero copy reads out of a Bitsery serialized buffer.

several years ago i mentioned wanting this zero copy functionality (similar to Flatbuffers), and i think i finally thought of the perfect way to do it. (and once we get Reflection in C++26, we can drop the FieldRegistry code completely and it can all be automatic). i ran some perf tests and this appears on my side to be zero-cost, as in you don't pay for it at all unless you enable it.

i'm sure you'll have some thoughts, so let's discuss if you want any redesigns.

Restore include/bitsery/serializer.h to the master implementation and move\noffset-table recording into a dedicated OffsetTableWriteSerializer used\nonly by serializeWithOffsetTable.\n\nThis redesign means callers that are not serializing for zero-copy read\nstay on the plain Serializer path and pay 0 added runtime cost from the\noffset-table feature.\n\nIt also improves the zero-copy serialization path with cheaper recorded\nentries plus static-root, registry-metadata, and last-dynamic-cache-hit\nlookups on the offset-table side.\n\nMeasured on build-release/tests/bitsery.test.offset_table_perf:\n- Sample (50k iters): quickSerialization=0.47-0.49ms, offsetTableEnabled=0.57-0.61ms\n- KitchenSink (5k iters): quickSerialization=0.26-0.27ms, offsetTableEnabled=0.91-0.94ms\n\nAlso fix the GCC -Wmaybe-uninitialized std::variant deserialize path by\nusing variant::emplace instead of temporary-variant assignment.
Remove the unused tableOffsets/payloadSize parameters from emitTableFromCapture.\n\nClang and AppleClang treat those dead parameters as -Wunused-parameter errors\nin the PR CI matrix, so the offset-table targets failed to build even though\nGCC was green. The helper only serves the flat capture fast path and does not\nneed either value.
Reduce the zero-copy serialization code surface while preserving the specialized offset-table fast path.

This keeps the feature-off serialization path clean, collapses the manual offset serializer onto the specialized writer, reuses the normal table/trailer writer for capture, and shares only the shorthand/archive surface between the plain and offset serializers.

Compared with the baseline for this reduction pass, the final warm perf runs improved to:
- Sample (50k): quickSerialization 0.47ms, offsetTableEnabled 0.56-0.57ms (was 0.50-0.53ms and 0.61-0.65ms)
- KitchenSink (5k): quickSerialization 0.27-0.30ms, offsetTableEnabled 0.94-0.97ms (was 0.28-0.30ms and 1.00-1.07ms)

This pass also reduced the branch code surface: the net delta versus master dropped from +4132 lines to +3657 lines, and this commit itself is 309 insertions vs 784 deletions in include/bitsery.
Rewrite the zero-copy-read reflection path around GCC16 C++26 reflection-generated schemas, payload traversal, and table finalization. The generated path now bypasses the generic offset-table serializer for serializer-free reflected aggregates while preserving the normal serializer path for custom bitsery wire policy.

Final GCC16 measurements from tests/offset_table_perf.cpp show the reflected zero-copy-read generated dynamic path at 0.270ms for 5,000 serializations, about 54ns each. The generic bitsery serializer plus offset-table path measured 0.635ms for the same 5,000 serializations, about 127ns each, making the reflected path about 2.35x faster for that benchmark.

For small fixed-layout data, plain quickSerialization measured 0.510ms for 50,000 serializations, about 10.2ns each, while reflected direct zero-copy-read with offset-table metadata measured 0.550ms for 50,000 serializations, about 11.0ns each. In that shape, zero-copy-read driven serialization is effectively free: it emits the metadata needed for borrowed reads while staying within about 0.8ns per object of regular bitsery serialization.

This branch is intentionally GCC16-only for reflection experiments because the implementation depends on C++26 reflection support and -freflection. Clang/MSVC should stay on the non-reflection fallback until upstream Clang grows compatible reflection support.

Verification: GCC16 reflection CTest 389/389 enabled tests passed; GCC16 no-reflection offset-table subset 26/26 plus check_includes passed; disabled perf tests were run separately with pinned CPU measurements; git diff --check passed.
@victorstewart
Copy link
Copy Markdown
Contributor Author

with Reflections in GCC16 now we can get zero-copy-read enabled serialization for free!

Shrink the offset-table trailer from 16 bytes to 14 bytes by removing the reserved field and keeping rootTableOff at offset 8 for cheap parsing. Preserve the 16-byte Entry stride after A/B testing showed 15-byte packed entries regressed zero-copy view reads.

A/B notes against HEAD on GCC 15.2.1 with taskset -c 3: sample payload size 108 -> 106 bytes, kitchen payload size 554 -> 552 bytes. Final compact-trailer run kept sample_view neutral at 64.849ms vs baseline 65.163/65.545ms, kitchen_verify at 234.219ms vs 231.572/234.975ms, kitchen_offset at 657.397ms vs 657.950/655.664ms, and sample_offset at 661.292ms vs warmed baseline 667.748ms.

Verification: cmake --build build-release -j48; ctest --test-dir build-release --output-on-failure -j48 passed 387/387, with the two disabled offset-table perf tests not run by CTest.
Rewrite the zero-copy-read branch around schema-known reflected C++ types instead of a generic self-describing offset table. The typed wire path now emits the same payload bytes as normal bitsery, with no table, header, or trailer. A first member named version remains a normal payload field and is checked on read; unversioned types use append-only compatibility; dynamic fields keep bitsery's inline size prefixes.

Delete the old offset-table adapter, registry, details implementation, inspect/view headers, reader/serializer tests, and offset-table perf harness. Add typed-wire tests for exact normal-bitsery byte parity, fixed no-header payloads, version mismatch rejection, append-only reads, dynamic inline lengths, nested reflected objects, and the no-reflection fallback.

Verification: GCC16 full suite passed 368/368 enabled tests; GCC16 debug typed-wire passed 7/7 plus check_includes; no-reflection fallback passed 1/1 plus check_includes; git diff --cached --check was clean.

Perf medians over 5 release runs versus quickSerialization: application-state 1.05ms vs 1.05ms for 100k writes, 1 byte; activity 0.54ms vs 0.59ms for 50k writes, 24 bytes; versioned-state 0.58ms vs 0.53ms for 50k writes, 8 bytes; static-sample 0.58ms vs 0.58ms for 50k writes, 40 bytes; dynamic-sample 0.15ms vs 0.15ms for 5k writes, 236 bytes; kitchen-sink 0.31ms vs 0.31ms for 5k writes, 418 bytes.
@victorstewart
Copy link
Copy Markdown
Contributor Author

used reflection to dump the offset table... now zero-copy-read generated buffers are byte identical to regular bitsery buffers and equal cost and reflection handles the reading via leveraging the C++ type information! so no reason not to integrate this work now!

Now that typed wire buffers are just normal bitsery bytes, remove the old serializer split and buffer-view alias from the branch. Keep regular bitsery as the only writer and make typed_wire.h a schema-known reflected read/view layer over those bytes.

The remaining branch delta against master is intentionally small: typed_wire.h, typed_wire tests/perf, and the existing GCC16 std_optional/std_variant warning fixes needed to keep the full reflection build green.

Verification: CMake regenerated for GCC16 release/debug/no-reflection; GCC16 release build passed; full CTest passed 368/368 enabled tests; debug typed-wire passed 7/7; no-reflection fallback passed 1/1; check_includes passed in all three build trees; staged whitespace check passed with cr-at-eol allowed for restoring master CRLF serializer.h.

Read perf medians over five release runs versus quickDeserialization: application-state 0.08ms vs 0.08ms for 100k reads, activity 0.04ms vs 0.04ms for 50k, versioned-state 0.04ms vs 0.05ms for 50k, static-sample 0.04ms vs 0.04ms for 50k, dynamic-sample 0.10ms vs 0.01ms for 5k, kitchen-sink 0.26ms vs 0.11ms for 5k.
@victorstewart
Copy link
Copy Markdown
Contributor Author

now this PR is nothing more than a reflection driven zero copy reader, leverage the C++ type information to read generic bitsery serialized buffers

CI jobs without C++26 reflection still compile the disabled typed_wire_perf helper, which names makeTypedWireView at template definition time. The no-reflection branch only declared TypedWireView, so GCC/Clang/MSVC builds failed before tests.

Add no-reflection factory overloads that return Status::NoReflection views and cover both overloads in the fallback test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant