sg16: Re: [SG16] (SC22WG14.17682) mbrtowc() wording ambiguities and surprising implementation behavior

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 29 Mar 2020 16:20:31 -0400

Due to a mail server outage at isocpp.org, the SG16 mailing list did not
receive prior emails for this thread. For anyone that wishes to view
them, they are available on the WG14 archive as listed below in
chronological order.

  * http://open-std.org/jtc1/sc22/wg14/17674
  * http://open-std.org/jtc1/sc22/wg14/17677
  * http://open-std.org/jtc1/sc22/wg14/17682
  * http://open-std.org/jtc1/sc22/wg14/17684
  * http://open-std.org/jtc1/sc22/wg14/17697
  * http://open-std.org/jtc1/sc22/wg14/17698
  * http://open-std.org/jtc1/sc22/wg14/17699
  * http://open-std.org/jtc1/sc22/wg14/17703
  * http://open-std.org/jtc1/sc22/wg14/17713

Any further messages should appear on both mailing lists (except for any
responses to this particular email as I did not copy wg14 on it).

Tom.

On 3/28/20 4:46 PM, Hubert Tong via SG16 wrote:
> [Note: Cross-posted between the WG 14 and WG 21/SG 16 reflectors]
>
> On Sat, Mar 28, 2020 at 3:34 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 3/28/20 10:06 AM, Hubert Tong wrote:
>> On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> I came across the following issues while testing an
>> implementation of mbrtoc8() [1] I'm working on. The
>> implementation uses mbrtowc() internally.
>>
>> [ ... ]
>>
>>
>> The issues are demonstrated using an example of converting
>> one byte at a time, a Big5-HKSCS double byte sequence that
>> maps to two Unicode code points (assume the wide execution
>> character set is UTF-16 (or UCS2) or UTF32):
>>
>> * 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
>> CIRCUMFLEX} {COMBINING MACRON}
>>
>> The scenario presented violates the definition of "wide
>> character", which indicates the relationship between values of
>> wchar_t and the C standard concept of a "character":
>> value representable by an object of type wchar_t, capable of
>> representing any character in the current locale
> Indeed, but that definition of wide character in the standard
> contradicts long standing existing practice (e.g., use of UTF-16
> on Windows).
>
> What I mean is that asking about the behaviour of a function in a
> scenario that contradicts its underlying model is unlikely to lead to
> helpful action in terms of interpreting the wording.
>
> This situation is similar in some respects to the
> __STDC_MB_MIGHT_NEQ_WC__ one. "Long-standing existing practice"
> indicates that something about the standard does not serve a community
> of users. The standard in the case of __STDC_MB_MIGHT_NEQ_WC__ says
> that there are environments where certain assumptions don't hold.
> Users who have to operate in such an environment can detect and take
> it into account. Users who don't have to operate in such an
> environment can safely ignore it and be assured their program is
> portable within their needs.
>
> So, we probably need to accommodate "odd" operating environments, but
> would need to look for some balance so as to not complicate the
> situation too much.
>
>>
>> I doubt that wide characters should be considered the preferred
>> solution for dealing with UCS encodings or notions that
>> characters are formed by more than one minimal well-formed code
>> unit sequence.
>>
> I certainly agree with the first part of that statement, but not
> the second considering existing practice.
>
> Just to ensure we understand each other. I did not say anything in the
> second part of the statement that contradicts the existence of
> surrogate pairs. I am pointing out that there is a technical issue of
> using UTF-8 as the multibyte string encoding if a character is
> considered to require more than a single UCS scalar value. An
> implementation of mblen should not return different non-negative
> values for successive calls with the same non-null pointer simply
> because the `n` parameter is changed.
>
> Tom.
>
>
>

Received on 2020-03-29 15:23:27