Skip to content

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

@mbrock

Description

@mbrock

The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

To escape an extended character that is not in the Basic Multilingual                               
Plane, the character is represented as a twelve-character sequence,                                 
encoding the UTF-16 surrogate pair.  So, for example, a string                                      
containing only the G clef character (U+1D11E) may be represented as                                
"\uD834\uDD1E".

But SWI-Prolog's JSON parser reads that string as two (invalid) characters.

I have fixed this in my fork and will submit a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions