Skip to content

genUnicodeString generates invalid unicode #167

@martyall

Description

@martyall

The unicode character generator for unicode characters is picking a random CodePoint in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7

One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done CodePoint by CodePoint.

For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions