C++ Logo

SG16

Advanced search

Subject: Re: (SC22WG14.17674) mbrtowc() wording ambiguities and surprising implementation behavior
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2020-03-28 09:06:02


On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom_at_[hidden]> wrote:

> I came across the following issues while testing an implementation of
> mbrtoc8() [1] I'm working on. The implementation uses mbrtowc() internally.
>
[ ... ]

>
> The issues are demonstrated using an example of converting one byte at a
> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
> points (assume the wide execution character set is UTF-16 (or UCS2) or
> UTF32):
>
> - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
> {COMBINING MACRON}
>
> The scenario presented violates the definition of "wide character", which
indicates the relationship between values of wchar_t and the C standard
concept of a "character":
value representable by an object of type wchar_t, capable of representing
any character in the current locale

I doubt that wide characters should be considered the preferred solution
for dealing with UCS encodings or notions that characters are formed by
more than one minimal well-formed code unit sequence.



SG16 list run by herb.sutter at gmail.com