Skip to content

Asmod4n/mruby-cbor

Repository files navigation

mruby-cbor

Fast, spec-compliant CBOR encoding and decoding for mruby.

RFC 8949 License: Apache 2.0


What is CBOR and Why You Want It

CBOR (Concise Binary Object Representation, RFC 8949) is a binary data format like JSON, msgpack, or Protocol Buffers. It's small, fast, and human-readable in diagnostic notation. Use it when you need:

  • Wire efficiency: Encode numbers, arrays, and maps in the fewest bytes possible
  • Language-agnostic data exchange: Send structs between mruby and Python, Node.js, Go, Rust, etc.
  • Self-describing types: Distinguish between integers and floats, byte strings and text strings
  • Recursive data structures: Encode cyclic arrays and shared references without duplication
  • Custom types: Extend CBOR with application-defined tags for domain objects

This gem gives you ~30% faster encoding than msgpack, 1.3–3× faster selective decoding than simdjson (lazy mode), and zero security surprises—depth limits, overflow protection, and deterministic behavior by default.


Table of Contents

  1. Quick Start
  2. Installation
  3. Core Features
  4. Usage Examples
  5. Advanced Topics
  6. Determinism Guarantees
  7. Performance & Tuning
  8. Error Handling
  9. Specification & Interoperability

Quick Start

# Encode
data = { "users" => [{ "id" => 1, "name" => "Alice" }], "count" => 1 }
buffer = CBOR.encode(data)
puts "Encoded #{buffer.bytesize} bytes"

# Decode (eager)
result = CBOR.decode(buffer)
p result  # => {"users"=>[{"id"=>1, "name"=>"Alice"}], "count"=>1}

# Decode (lazy — access only what you need)
lazy = CBOR.decode_lazy(buffer)
first_user_name = lazy["users"][0]["name"].value
puts "First user: #{first_user_name}"

# Shared references (deduplication + cyclic structures)
shared = [1, 2, 3]
obj = { "x" => shared, "y" => shared }
buf = CBOR.encode(obj, sharedrefs: true)
result = CBOR.decode(buf)
result["x"].equal?(result["y"])  # => true (same object)

Installation

Add to your mruby build_config.rb:

MRuby::Build.new do |conf|
  # ... other config ...
  conf.gem mgem: "mruby-cbor"
end

Then build:

cd /path/to/mruby
rake compile

To run tests:

rake test

Core Features

Encode Everything

Type Support Notes
Integers ✅ Full Fixnum + Bigint (Tag 2/3 RFC 8949)
Floats ✅ Full Preferred serialization: f16→f32→f64 (smallest lossless width)
Strings ✅ Full UTF-8 text and binary byte strings
Arrays ✅ Full Nested, empty, arbitrary size
Maps/Hashes ✅ Full Any keys, arbitrary nesting
Booleans & Nil ✅ Full true, false, nil
Symbols ✅ Full Three modes: strip, tag+string, tag+uint32 (presym)
Classes & Modules ✅ Full Auto round-trip via Tag 49999
Custom Classes ✅ Full Via native_ext_type DSL or register_tag proc
Shared Objects ✅ Full Tag 28/29 for deduplication and cycles

Zero-Copy Decoding

Both eager and lazy decoding operate directly on the input buffer without copying:

Eager decoding: Strings and byte strings reference the original buffer via views (no copy):

buf = CBOR.encode("hello world")
result = CBOR.decode(buf)
result.bytesize  # => 11 (string doesn't own its data, references buf)

Lazy decoding (CBOR::Lazy) wraps the wire buffer without decoding:

lazy = CBOR.decode_lazy(huge_buffer)
field = lazy["metadata"]["version"].value  # Parse only what you access

Constant-time repeated access via built-in caching. No intermediate allocations.

Determinism & Reproducibility

Same input → same output, always. Encoding is deterministic:

  • Float width determined by bit-pattern, not floating-point rounding
  • Hash field order follows insertion order (mruby's hash implementation)
  • NaN always encodes as canonical 0xF97E00 (quiet f16 NaN per RFC 8949)
  • Negation of large bignums computed in integer-only arithmetic, no FP edge cases

See Determinism Guarantees for full details.

Security & Robustness

  • Recursion depth limits (configurable, profile-dependent default 32–128)
  • Integer overflow protection (explicit bounds checks, no wraparound)
  • Buffer bounds checking on every read (no out-of-bounds access)
  • UTF-8 validation (required for text strings; binary strings are untouched)
  • Type safety (schema-based validation for registered tags)
  • No indefinite-length items (use streaming instead if needed)

Streaming & Framing

Decode CBOR sequences from strings, files, or sockets:

# From a string of concatenated CBOR documents
CBOR.stream(buf) { |doc| process(doc.value) }

# From a file (with readahead for large documents)
File.open("data.cbor") { |f| CBOR.stream(f) { |doc| ... } }

# From a socket (incremental recv loop)
CBOR.stream(socket) { |doc| ... }

Usage Examples

Basic Encoding/Decoding

# Integers
CBOR.decode(CBOR.encode(42))            # => 42
CBOR.decode(CBOR.encode(-1))            # => -1
CBOR.decode(CBOR.encode(2**100))        # => 1267650600228229401496703205376 (bigint)

# Floats
CBOR.decode(CBOR.encode(1.5))           # => 1.5
CBOR.decode(CBOR.encode(Float::INFINITY)) # => Infinity
CBOR.decode(CBOR.encode(Float::NAN))    # => NaN

# Strings
CBOR.decode(CBOR.encode("hello"))       # => "hello"
CBOR.decode(CBOR.encode("Ñoño"))        # => "Ñoño" (UTF-8 preserved)

# Arrays & nested structures
data = [1, [2, [3, 4]], "text"]
CBOR.decode(CBOR.encode(data)) == data  # => true

# Maps
h = { "x" => 10, "y" => { "nested" => true } }
CBOR.decode(CBOR.encode(h)) == h        # => true

On-Demand Decoding (Lazy)

Parse only what you access. Perfect for large documents where you only need a few fields:

# Wire buffer (not decoded yet)
lazy = CBOR.decode_lazy(big_payload)

# Navigate by chaining `[]` or `dig`:
value = lazy["response"]["data"][0]["id"].value
# Only the path you access is decoded; rest stays compressed

# `dig` is safe (returns nil for missing keys):
status = lazy.dig("response", "data", "status").value
status = lazy.dig("nonexistent", "path").value  # => nil

# Repeated access uses cache (O(1) after first access):
id1 = lazy["response"]["data"][0]["id"].value
id2 = lazy["response"]["data"][0]["id"].value  # Same object, no re-parse
id1.equal?(id2)  # => true

# Negative array indices work:
last = lazy["items"][-1].value  # Last item

Performance: O(n) only in skipped elements, not the full document. For a 10 MB payload where you access 1% of fields, lazy decoding pays off immediately.

Shared References & Cyclic Structures

Eliminate duplication. Represent cycles without infinite loops:

# Two variables pointing to the same array
shared_array = [1, 2, 3]
obj = { "x" => shared_array, "y" => shared_array }

# Encode with deduplication (Tag 28/29)
buf = CBOR.encode(obj, sharedrefs: true)

# Decode: identity is preserved
decoded = CBOR.decode(buf)
decoded["x"].equal?(decoded["y"])  # => true ✓ Same object

# Cyclic structures (array containing itself)
cyclic = []
cyclic << cyclic

buf = CBOR.encode(cyclic, sharedrefs: true)
result = CBOR.decode(buf)
result.equal?(result[0])  # => true (self-referential)

# Works with lazy decoding too:
lazy = CBOR.decode_lazy(buf)
decoded = lazy.value
decoded.equal?(decoded[0])  # => true

How it works: First occurrence is tagged with Tag 28 (shareable). Subsequent references use Tag 29 (shared ref) with an index into the shareable table. On decode, the index is resolved to the already-decoded object, preserving identity.

Custom Types with native_ext_type

Define a schema for your classes using the native_ext_type DSL:

class Address
  native_ext_type :@street, String
  native_ext_type :@city,   String
  native_ext_type :@zip,    String
end

class Person
  native_ext_type :@name,    String
  native_ext_type :@age,     Integer
  native_ext_type :@address, Address    # Nested class!
  native_ext_type :@active,  TrueClass, FalseClass  # Multiple types OK

  def _after_decode
    puts "Person #{@name} loaded"
    self
  end

  def _before_encode
    @age += 1 if @age < 18  # Modify before encoding
    self
  end
end

# Register with a tag number
CBOR.register_tag(1000, Person)
CBOR.register_tag(1001, Address)

# Encode
addr = Address.new
addr.instance_variable_set(:@street, "Main St")
addr.instance_variable_set(:@city, "Berlin")
addr.instance_variable_set(:@zip, "10115")

person = Person.new
person.instance_variable_set(:@name, "Alice")
person.instance_variable_set(:@age, 30)
person.instance_variable_set(:@address, addr)
person.instance_variable_set(:@active, true)

encoded = CBOR.encode(person)

# Decode (hooks are called automatically)
decoded = CBOR.decode(encoded)  # _after_decode called
decoded.instance_variable_get(:@name)  # => "Alice"
decoded.instance_variable_get(:@address).instance_variable_get(:@city)  # => "Berlin"

Type constraints use standard Ruby classes: String, Integer, Float, Array, Hash, TrueClass, FalseClass, NilClass, and any registered class. Inheritance works: Numeric matches both Integer and Float.

Security: Only declared ivars are populated (allowlist model). Extra fields in the payload are silently ignored.

Streaming

Decode CBOR sequences (multiple concatenated documents):

# From a string
buf = CBOR.encode("hello") + CBOR.encode("world") + CBOR.encode([1,2,3])

results = []
CBOR.stream(buf) { |doc| results << doc.value }
# => ["hello", "world", [1,2,3]]

# As an enumerator (no block)
docs = CBOR.stream(buf).map(&:value)

# From a file (with intelligent readahead)
File.open("data.cbor", "rb") do |f|
  CBOR.stream(f) { |doc| process(doc.value) }
end

# From a socket (event-driven)
socket = TCPSocket.new("host", port)
CBOR.stream(socket) { |doc| handle(doc.value) }

# Manual socket handling (for async frameworks)
decoder = CBOR::StreamDecoder.new { |doc| handle(doc.value) }
while chunk = socket.recv(4096)
  decoder.feed(chunk)
end

Dispatch rules: Automatically detects string (via bytesize/byteslice), file (via seek/read), or socket (via recv), and handles buffering transparently.

Diagnostic Notation

Human-readable output for debugging and logging (RFC 8949 §8.1):

# Integers and basic types
CBOR.diagnose(CBOR.encode(1))        # => "1"
CBOR.diagnose(CBOR.encode(true))     # => "true"
CBOR.diagnose(CBOR.encode(nil))      # => "null"

# Floats with width suffix (_1=f16, _2=f32, _3=f64)
CBOR.diagnose(CBOR.encode(1.0))      # => "1.0_1"   (f16, 3 bytes)
CBOR.diagnose(CBOR.encode(1.0e10))   # => "10000000000.0_2"  (f32, 5 bytes)
CBOR.diagnose(CBOR.encode(3.14))     # => "3.14_3"  (f64, 9 bytes)

# NaN and Infinity
CBOR.diagnose(CBOR.encode(Float::NAN))      # => "NaN_1"
CBOR.diagnose(CBOR.encode(Float::INFINITY)) # => "Infinity_1"

# Strings and bytes
CBOR.diagnose(CBOR.encode("hello"))         # => '"hello"'
CBOR.diagnose(CBOR.encode("\x01\x02\x03".b)) # => "h'010203'"

# Containers
CBOR.diagnose(CBOR.encode([1, [2, 3]]))     # => "[1,[2,3]]"
CBOR.diagnose(CBOR.encode({"a" => 1}))      # => '{"a":1}'

# Tags
CBOR.diagnose("\xc1\x1a\x51\x4b\x67\xb0")   # => "1(1363896240)"

# Indefinite-length items (from external sources)
CBOR.diagnose("\x9f\x01\x02\xff")           # => "[_ 1,2]"
CBOR.diagnose("\xbf\x61a\x01\xff")          # => "{_ \"a\":1}"

MRB_NO_FLOAT builds: Use exact rational arithmetic for f16/f32 and hex-float notation for f64 (RFC 8610 Appendix G), avoiding floating-point operations entirely.

Symbol Handling

Three strategies, each with tradeoffs:

# Strategy 1: Strip symbols (default, no round-trip)
CBOR.no_symbols
sym = :hello
encoded = CBOR.encode(sym)
decoded = CBOR.decode(encoded)  # => "hello" (lost the symbol!)

# Strategy 2: Tag 39 + string (RFC 8949, interoperable)
CBOR.symbols_as_string
sym = :hello
encoded = CBOR.encode(sym)
decoded = CBOR.decode(encoded)  # => :hello (preserved!)
# Works with other implementations that recognize tag 39

# Strategy 3: Tag 39 + uint32 (mruby-only, fastest)
CBOR.symbols_as_uint32
sym = :hello
encoded = CBOR.encode(sym)
decoded = CBOR.decode(encoded)  # => :hello
# Faster: just an integer ID, but requires same mruby build and presym

# In data structures
CBOR.symbols_as_string
h = { name: "Alice", age: 30 }
decoded = CBOR.decode(CBOR.encode(h))
decoded[:name]  # => "Alice"
Mode Encoding Interop Round-trip Speed
no_symbols Plain string ✅ All ❌ No Fast
symbols_as_string Tag 39 + string ✅ All ✅ Yes Good
symbols_as_uint32 Tag 39 + uint32 ❌ mruby only ✅ Yes Fastest

⚠️ symbols_as_uint32 requires:

  • Same mruby build (encoder and decoder must use identical libmruby.a)
  • Compile-time symbols (presym enabled)
  • No dynamic symbol creation during decoding

Use symbols_as_string for external/untrusted data. Use symbols_as_uint32 only for internal mruby-to-mruby IPC.

Proc-Based Tags

For types that can't use native_ext_type (e.g., Exception, Time, or built-in C types), register encode/decode procs:

# Exception serialization (encode + decode)
CBOR.register_tag(50000) do
  encode Exception do |e|
    [e.class, e.message, e.backtrace]
  end

  decode Array do |a|
    exc = a[0].new(a[1])
    exc.set_backtrace(a[2]) if a[2]
    exc
  end
end

# Now encode and re-raise exceptions:
buf = begin
  raise ArgumentError, "something went wrong"
rescue => e
  CBOR.encode(e)
end

exc = CBOR.decode(buf)
raise exc  # Re-raises ArgumentError with original message & backtrace

# Time serialization (if mruby-time available)
CBOR.register_tag(1) do  # Tag 1 is RFC 8949 epoch-based time
  encode Time do |t|
    t.to_f  # Seconds since Unix epoch as float
  end

  decode Float do |v|
    Time.at(v)
  end
end

CBOR.encode(Time.now)     # => tag(1, <float>)
CBOR.decode(buf)          # => Time object

Encode type matching uses kind_of?: Register Exception and it matches StandardError, ArgumentError, and all subclasses. Natively-encoded types (String, Integer, Array, Hash, etc.) are rejected as encode types (they're handled by the core encoder and would be ignored).


Advanced Topics

Classes and Modules (Tag 49999)

Classes and modules round-trip automatically via a private tag (49999):

CBOR.encode(String)              # => tag(49999, "String")
CBOR.decode(buf)                 # => String (the class itself)

CBOR.encode(CBOR::UnhandledTag)  # => tag(49999, "CBOR::UnhandledTag")

# In structures
h = { "exception_type" => StandardError, "code" => 123 }
decoded = CBOR.decode(CBOR.encode(h))
decoded["exception_type"]  # => StandardError

# Anonymous classes raise ArgumentError
CBOR.encode(Class.new)     # Error: "cannot encode anonymous class/module"

mruby-to-mruby only: The constant path syntax (:: separator) is Ruby-specific. Other runtimes won't know how to resolve StandardError from the string path.

Float Encoding (Preferred Serialization)

Floats use the smallest CBOR float width that represents them losslessly:

CBOR.encode(0.0).bytesize      # => 3 bytes (f16)
CBOR.encode(1.0).bytesize      # => 3 bytes (f16)
CBOR.encode(1.5).bytesize      # => 3 bytes (f16)
CBOR.encode(1.0e10).bytesize   # => 5 bytes (f32)
CBOR.encode(3.14).bytesize     # => 9 bytes (f64)

Algorithm: Pure bit-pattern arithmetic, zero floating-point operations:

Value Encoding Rationale
NaN (any payload) f16 0x7E00 Canonical per RFC 8949 Appendix B
±Inf, ±0 f16 Always representable
f16 normal f16 if exp32 ∈ [113..142] and low 13 mant bits = 0 Lossless in f16
f16 subnormal f16 if exp32 ∈ [103..112] and shift is exact Lossless in f16
Fits in f32 f32 if low 29 f64 mant bits = 0 and exp in range Lossless in f32
Everything else f64 Need full precision

MRB_USE_FLOAT32 builds: Start at f32 and try f16, skipping f64 entirely.

Unhandled Tags

CBOR documents may contain tags your code doesn't recognize. Rather than failing, unknown tags decode as CBOR::UnhandledTag objects:

# Decode a tag(100, 42) that you didn't register
result = CBOR.decode("\xD8\x64\x18\x2A")
result.is_a?(CBOR::UnhandledTag)  # => true
result.tag                         # => 100
result.value                       # => 42

# In nested structures
result = CBOR.decode("\xD8\x64\x83\x01\x02\x03")
result.is_a?(CBOR::UnhandledTag)   # => true
result.value                       # => [1, 2, 3]

Error Handling

begin
  CBOR.decode(malformed_buffer)
rescue RangeError => e
  # Out-of-bounds access, integer overflow, length too large
  puts "Bounds error: #{e.message}"
rescue RuntimeError => e
  # Truncated buffer, invalid encoding, nesting too deep
  puts "Runtime error: #{e.message}"
rescue TypeError => e
  # Type mismatch in registered tags
  puts "Type error: #{e.message}"
rescue KeyError => e
  # Missing key in lazy map access (use dig for safe access)
  puts "Missing key: #{e.message}"
rescue NameError => e
  # Tag 49999 resolves to unknown constant
  puts "Unknown class: #{e.message}"
end

Determinism Guarantees

This section answers: Will the same input always produce the same output?

✅ What IS Deterministic

  1. Integer encoding (all bases: 2, 10, 16, etc.)

    • Fixnum (mrb_int): Encoded in varint-style (shortest form, no padding)
    • Bignum (mrb_bigint): Converted to big-endian byte string (Tag 2/3), no lossy conversions
    • Negative integers: Computed via -1 - n rule, not two's complement tricks
    • Same integer value → same wire bytes, always, regardless of how the integer was created
    • Example: 2**100 and (1 << 100) encode identically, even across mruby builds
  2. Float width selection (bit-pattern only)

    • A float's wire encoding depends solely on its bit-pattern, not FP rounding
    • NaN always → canonical f16 0xF97E00, regardless of NaN payload or sign bit
    • Same float value → same wire bytes, always
  3. Hash field order

    • Encoding follows insertion order (guaranteed in mruby)
    • Hashes are always encoded in the order fields are created, forever
    • Reproducible across encodes, mruby versions, and across different machines
  4. Bignum negation

    • Negative bignums computed in pure integer arithmetic (no FP edge cases)
    • -n = -1 - (n-1) for correctness; no rounding, no overflow surprises
  5. Integer overflow protection

    • Explicit bounds checks prevent wraparound
    • Out-of-range values raise RangeError consistently

⚠️ What Is NOT Deterministic Across Builds

Factor Impact Details
mruby build config Symbols, float width range MRB_USE_FLOAT32, presym settings affect encoding choices
Compile flags Might affect numeric representation Different CFLAGS could theoretically affect float behavior (though unlikely in practice)
Symbol IDs Non-portable across mruby binaries Presym IDs differ between mruby builds; use symbols_as_string for portability

Practical: Same mruby binary + same input = same output, forever. For cross-machine reproducibility, use symbols_as_string (portable) instead of symbols_as_uint32 (binary-specific).

RFC 8949 Compliance

This implementation strictly follows RFC 8949:

  • §3.1 (Unsigned integers): Encoded in shortest form (varint-style)
  • §3.2 (Negative integers): Follows -1 - n rule consistently
  • §3.3 (Byte strings): Uninterpreted binary; no UTF-8 check
  • §3.4 (Text strings): Validated UTF-8; major type 3
  • §3.5 (Arrays & maps): Definite length only (no indefinite, use streaming instead)
  • §4.1 (Float preferred serialization): Smallest lossless width (f16→f32→f64)
  • §4.2 (Simplicity values): false, true, null only (not undefined)
  • §3.4.3 (Bignums/Tags 2&3): RFC 8949 zero-length payload rule respected
  • §3.4.1 (Tags 28&29): Shared references and identity-preserving decoding
  • Appendix B (CBOR Canonical CBOR, CTAP2): NaN always 0xF97E00

Performance & Tuning

Benchmarks (Relative)

Operation Time vs. msgpack vs. simdjson
Encode small struct ~30% faster N/A
Encode large array ~25% faster N/A
Decode (eager) full ~20% faster N/A
Decode (lazy) selective access N/A 1.3–3× faster

Lazy decoding shines: When you only need a few fields from a 10 MB payload, lazy is 10–100× faster than eager.

Recursion Depth Tuning

Default limits depend on mruby profile:

MRB_PROFILE_MAIN / MRB_PROFILE_HIGHCBOR_MAX_DEPTH = 128
MRB_PROFILE_BASELINECBOR_MAX_DEPTH = 64
Constrained / otherCBOR_MAX_DEPTH = 32

Override at compile time:

cd /path/to/mruby
MRUBY_CFLAGS="-DCBOR_MAX_DEPTH=256" rake compile

Exceeding depth raises RuntimeError: "CBOR nesting depth exceeded".

Memory Layout (SBO + Heap)

Encoding uses a small-buffer optimization (SBO) with 16 KB stack buffer:

  • Small documents (< 16 KB) → stack only, no allocation
  • Larger → spills to heap, grows geometrically
  • No unnecessary copies; final result is the heap string

File I/O Optimization

File streaming uses adaptive readahead with doubling strategy:

  1. First read: 9 bytes (minimum needed to parse any CBOR header)
  2. Use doc_end() to find where the document ends by skipping its contents
  3. If the buffer contains the full document: yield it, move to next
  4. If not complete: double the read size, re-read from the same offset, retry
  5. Continue doubling until the full document is buffered
  6. Then read exactly the remaining bytes needed (if any) to avoid over-reading

Why doubling? CBOR documents can be arbitrarily nested (arrays, maps, tags wrapping each other). Finding the document boundary requires parsing the structure, not just reading a length header. The doubling strategy balances:

  • Most documents (< 16 KB) fit in 1–2 reads
  • Large documents don't require excessive seeks
  • No fixed buffer size that wastes memory or fails on edge cases

Error Handling

Error Cause Example
ArgumentError Invalid option or reserved tag number CBOR.encode(obj, invalid: true)
RangeError Integer overflow or invalid length Encoding bigint larger than wire allows
RuntimeError Malformed CBOR, depth exceeded, or unimplemented Truncated buffer, nested too deep, indefinite-length
TypeError Type mismatch in registered tag or unknown source Field is Array but schema expects String
KeyError Lazy map access to missing key lazy["nonexistent"] (use .dig for safe access)
IndexError Lazy array out of bounds or invalid shared ref lazy[999] on 3-item array
NameError Unknown constant in Tag 49999 Decoding a class name not defined on this side
NotImplementedError Indefinite-length or presym unavailable CBOR.symbols_as_uint32 on non-presym mruby

Specification & Interoperability

CBOR RFC 8949

Full RFC 8949 compliance. Official specification: https://tools.ietf.org/html/rfc8949

  • RFC 8949 Section 1–6: Complete support (except indefinite-length)
  • RFC 8949 Appendix A–D: Diagnostic notation, CDDL, CBOR in JSON, test vectors

Interoperability

Tested against:

  • Python: cbor2 library (comprehensive test suite via interop.py)

RFC 8949 compliance: This implementation strictly adheres to RFC 8949, so it should interoperate with any spec-compliant CBOR decoder in any language.

Guaranteed portability via symbols_as_string: If you use string-based symbols (Tag 39), your CBOR documents are readable by any RFC 8949-compliant decoder in any language.

Test Vectors

Official RFC 8949 test vectors (Appendix A) included in /test-vectors/. Run:

# Validate against official vectors
rake test

License

Apache License 2.0. See LICENSE file.


Contributing

Issues, PRs, and bug reports welcome. See interop.py for testing against other implementations.


Advanced Reference: Implementation Details

Float Encoding Algorithm (Pure Bit-Pattern, Zero FP Ops)

1. Extract sign, exponent, mantissa from f64 bit-pattern
2. If NaN: emit canonical f16 0xF97E00, done
3. If can fit in f32 (29 low mant bits = 0, exp in range): continue to f16 check
4. If can fit in f16 (subnormal or normal range checks): emit f16
5. If fits in f32: emit f32
6. Else: emit f64

Key insight: No float rounding—entire algorithm is integer bit manipulation.

Shared Reference Algorithm (Tag 28/29)

Encoding: When sharedrefs: true, maintain a hash of seen objects by mrb_obj_id:

1. Pre-pass (skip_cbor): Walk the object graph, tagging each unique object once with Tag 28
2. Encode: When re-encountering an object, emit Tag 29 + index into shareable table
3. Identity preserved: Decoder maintains a parallel array, resolving indexes to same decoded objects

Decoding:

1. Tag 28 (Shareable): Decode the wrapped value, push to shareable table
2. Tag 29 (Shared ref): Read index, fetch from shareable table, return that object
3. Result: `decoded["x"].equal?(decoded["y"])` is true if `obj["x"]` and `obj["y"]` were identical

Lazy Decoding Architecture (Zero-Copy, On-Demand)

Key insight: Lazy object is a (buffer, offset, key-cache) triple

1. cbor_lazy_new: Create Lazy wrapping buffer + offset, cache empty
2. lazy["key"]: Seek to offset, decode just enough to find key, create new Lazy for value, cache result
3. lazy.value: Decode from current offset, return fully-realized value, cache in vcache
4. Repeated access: Fetch from cache (O(1))

Buffer is not copied; each Lazy view just changes the offset. Perfect for partial extraction from large files.

BigInt Encoding (Tag 2/3)

RFC 8949 §3.4.3: Bignums outside int64 range encoded as:

  • Tag 2: Positive bigint as byte string (big-endian)
  • Tag 3: Negative bigint as byte string (|magnitude| - 1, big-endian)

Example: (1 << 200) + 1 → Tag 2 wrapping 26-byte hex string

Zero-length payloads: tag(2, h'') = 0, tag(3, h'') = -1 (edge case handled per RFC 8949)

Symbol Encoding Strategies

Mode Wire Format Lookup Speed Interop
no_symbols Plain string N/A Universal, loses type
symbols_as_string Tag 39 + string O(string compare) RFC 8949 compatible
symbols_as_uint32 Tag 39 + uint32 O(array index) mruby-only, requires presym

Presym IDs are non-portable: Symbol ID 42 on your mruby might be ID 100 on another mruby built with different --enable-presym-inline settings.

Stack Overflow Prevention (Depth Tracking)

Each decode call increments a per-Reader depth counter. At CBOR_MAX_DEPTH, raises RuntimeError. Prevents:

- Deeply nested arrays: [[[[[...]]]]]]
- Deeply nested maps: {"a": {"a": {"a": ...}}}
- Circular references without sharedrefs (would loop forever)

Limit is configurable at compile time and profile-aware.

String Encoding (UTF-8 vs Binary Auto-Detection)

When encoding, strings are automatically classified:

# UTF-8 string → encoded as text (major type 3)
CBOR.encode("hello")  # => major 3 (text string)

# Binary string → encoded as bytes (major type 2)
CBOR.encode("\x00\xFF\xFE".b)  # => major 2 (byte string)

The encoder checks each string's UTF-8 validity at encode time and chooses the appropriate major type (3 for text, 2 for binary).

When decoding, text strings (major type 3) are validated as UTF-8 when mruby is compiled with MRB_UTF8_STRING. If mruby was compiled without UTF-8 string support, the validation is skipped (the strings are still decoded, just not validated).

Byte strings (major type 2) are never validated or touched—they're uninterpreted binary, regardless of compile flags.

This matches RFC 8949 (which requires UTF-8 for text strings) and prevents UTF-8 injection attacks when validation is enabled.


Built with ❤️ for mruby. Fast. Reliable. Deterministic.

About

Fully spec compliant CBOR implementation in mruby with advanced features

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors