sg16: Re: [SG16] (SC22WG14.17682) mbrtowc() wording ambiguities and surprising implementation behavior

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 28 Mar 2020 16:46:20 -0400

[Note: Cross-posted between the WG 14 and WG 21/SG 16 reflectors]

On Sat, Mar 28, 2020 at 3:34 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 3/28/20 10:06 AM, Hubert Tong wrote:
>
> On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> I came across the following issues while testing an implementation of
>> mbrtoc8() [1] I'm working on. The implementation uses mbrtowc() internally.
>>
> [ ... ]
>
>>
>> The issues are demonstrated using an example of converting one byte at a
>> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
>> points (assume the wide execution character set is UTF-16 (or UCS2) or
>> UTF32):
>>
>> - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
>> {COMBINING MACRON}
>>
>> The scenario presented violates the definition of "wide character", which
> indicates the relationship between values of wchar_t and the C standard
> concept of a "character":
> value representable by an object of type wchar_t, capable of representing
> any character in the current locale
>
> Indeed, but that definition of wide character in the standard contradicts
> long standing existing practice (e.g., use of UTF-16 on Windows).
>
What I mean is that asking about the behaviour of a function in a scenario
that contradicts its underlying model is unlikely to lead to helpful action
in terms of interpreting the wording.

This situation is similar in some respects to the __STDC_MB_MIGHT_NEQ_WC__
one. "Long-standing existing practice" indicates that something about the
standard does not serve a community of users. The standard in the case of
__STDC_MB_MIGHT_NEQ_WC__ says that there are environments where certain
assumptions don't hold. Users who have to operate in such an environment
can detect and take it into account. Users who don't have to operate in
such an environment can safely ignore it and be assured their program is
portable within their needs.

So, we probably need to accommodate "odd" operating environments, but would
need to look for some balance so as to not complicate the situation too
much.

>
> I doubt that wide characters should be considered the preferred solution
> for dealing with UCS encodings or notions that characters are formed by
> more than one minimal well-formed code unit sequence.
>
> I certainly agree with the first part of that statement, but not the
> second considering existing practice.
>
Just to ensure we understand each other. I did not say anything in the
second part of the statement that contradicts the existence of surrogate
pairs. I am pointing out that there is a technical issue of using UTF-8 as
the multibyte string encoding if a character is considered to require more
than a single UCS scalar value. An implementation of mblen should not
return different non-negative values for successive calls with the same
non-null pointer simply because the `n` parameter is changed.

> Tom.
>
>
>

Received on 2020-03-29 13:00:44