Date: Sat, 28 Mar 2020 10:06:02 -0400
On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom_at_[hidden]> wrote:
> I came across the following issues while testing an implementation of
> mbrtoc8() [1] I'm working on. The implementation uses mbrtowc() internally.
>
[ ... ]
>
> The issues are demonstrated using an example of converting one byte at a
> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
> points (assume the wide execution character set is UTF-16 (or UCS2) or
> UTF32):
>
> - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
> {COMBINING MACRON}
>
> The scenario presented violates the definition of "wide character", which
indicates the relationship between values of wchar_t and the C standard
concept of a "character":
value representable by an object of type wchar_t, capable of representing
any character in the current locale
I doubt that wide characters should be considered the preferred solution
for dealing with UCS encodings or notions that characters are formed by
more than one minimal well-formed code unit sequence.
> I came across the following issues while testing an implementation of
> mbrtoc8() [1] I'm working on. The implementation uses mbrtowc() internally.
>
[ ... ]
>
> The issues are demonstrated using an example of converting one byte at a
> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
> points (assume the wide execution character set is UTF-16 (or UCS2) or
> UTF32):
>
> - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
> {COMBINING MACRON}
>
> The scenario presented violates the definition of "wide character", which
indicates the relationship between values of wchar_t and the C standard
concept of a "character":
value representable by an object of type wchar_t, capable of representing
any character in the current locale
I doubt that wide characters should be considered the preferred solution
for dealing with UCS encodings or notions that characters are formed by
more than one minimal well-formed code unit sequence.
Received on 2020-03-29 12:12:01