[Note: Cross-posted between the WG 14 and WG 21/SG 16 reflectors]

On Sat, Mar 28, 2020 at 3:34 PM Tom Honermann <tom@honermann.net> wrote:

On 3/28/20 10:06 AM, Hubert Tong wrote:

On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom@honermann.net> wrote:

I came across the following issues while testing an implementation of mbrtoc8() [1] I'm working on. The implementation uses mbrtowc() internally.

[ ... ]

The issues are demonstrated using an example of converting one byte at a time, a Big5-HKSCS double byte sequence that maps to two Unicode code points (assume the wide execution character set is UTF-16 (or UCS2) or UTF32):

0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX} {COMBINING MACRON}

The scenario presented violates the definition of "wide character", which indicates the relationship between values of wchar_t and the C standard concept of a "character":
value representable by an object of type wchar_t, capable of representing any character in the current locale

Indeed, but that definition of wide character in the standard contradicts long standing existing practice (e.g., use of UTF-16 on Windows).

What I mean is that asking about the behaviour of a function in a scenario that contradicts its underlying model is unlikely to lead to helpful action in terms of interpreting the wording.

This situation is similar in some respects to the __STDC_MB_MIGHT_NEQ_WC__ one. "Long-standing existing practice" indicates that something about the standard does not serve a community of users. The standard in the case of __STDC_MB_MIGHT_NEQ_WC__ says that there are environments where certain assumptions don't hold. Users who have to operate in such an environment can detect and take it into account. Users who don't have to operate in such an environment can safely ignore it and be assured their program is portable within their needs.

So, we probably need to accommodate "odd" operating environments, but would need to look for some balance so as to not complicate the situation too much.

I doubt that wide characters should be considered the preferred solution for dealing with UCS encodings or notions that characters are formed by more than one minimal well-formed code unit sequence.

I certainly agree with the first part of that statement, but not the second considering existing practice.

Just to ensure we understand each other. I did not say anything in the second part of the statement that contradicts the existence of surrogate pairs. I am pointing out that there is a technical issue of using UTF-8 as the multibyte string encoding if a character is considered to require more than a single UCS scalar value. An implementation of mblen should not return different non-negative values for successive calls with the same non-null pointer simply because the `n` parameter is changed.

Tom.