On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom@honermann.net> wrote:

I came across the following issues while testing an implementation of mbrtoc8() [1] I'm working on.  The implementation uses mbrtowc() internally.

[ ... ]

The issues are demonstrated using an example of converting one byte at a time, a Big5-HKSCS double byte sequence that maps to two Unicode code points (assume the wide execution character set is UTF-16 (or UCS2) or UTF32):
  • 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX} {COMBINING MACRON}
The scenario presented violates the definition of "wide character", which indicates the relationship between values of wchar_t and the C standard concept of a "character":
value representable by an object of type wchar_t, capable of representing any character in the current locale

I doubt that wide characters should be considered the preferred solution for dealing with UCS encodings or notions that characters are formed by more than one minimal well-formed code unit sequence.