sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 30 Jan 2021 20:16:42 +0100

On Sat, Jan 30, 2021 at 7:53 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Sat, Jan 30, 2021 at 12:25 PM Corentin <corentin.jabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Sat, Jan 30, 2021 at 5:54 PM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Sat, Jan 30, 2021 at 6:18 AM Corentin <corentin.jabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>> Has consensus been found that UTF-16 is a valid wide execution
>>>>>> character set (encoding)? Are there general library facilities to handle
>>>>>> conversion of strings from wide execution character set (encoding)s with
>>>>>> characters that are encoded in more than one wchar_t code unit to other
>>>>>> encodings?
>>>>>>
>>>>>
>> I don't think so but the status quo does not match existing practice.
>> I think we could easily fix the core language but we may have to modify
>> the library wording because I think some functions can't deal with wide
>> multi-byte? Not sure https://github.com/sg16-unicode/sg16/issues/9
>> Paper needed :)
>>
> The general case would involve something like JeanHeyd's paper with those
> C library functions.
>
>>
>>>> - Each member of the basic character set is uniquely represented by
>>>> a single byte whose value, as read via a glvalue of type `char`, is
>>>> positive
>>>>
>>>> I think "basic character set" above isn't just the basic source
>>> character set. I think "basic execution character set" as a term happens to
>>> be the right name for what we need (just that the current definition is not
>>> what we want; we don't want a coded character set, and there shouldn't be a
>>> "narrow" and wide version). Also: The second half should read "... single
>>> code unit whose value is positive" now that you've defined the code units
>>> appropriately for that to work.
>>>
>>
>> I meant source here
>> What we want some characters a subset of (0+0000-U+0127) to always be
>> 1/representable 2/representable in 1 code unit. The indirection doesn't
>> serve much purpose (Unless I am missing something).
>>
> Except to give a convenient name for the addition of BELL, BACKSPACE, and
> CARRIAGE RETURN (CR).
>
>
>> Thanks for pointing this out explicitly. I think we have to leave the
>>> "locale-specific" around somewhere.
>>>
>>> The additional things that the current wording is probably trying to say
>>> are:
>>> In the execution environment, the library operates using locale-specific
>>> encodings for wide strings and byte strings.
>>> The characters in the basic execution character set shall be represented
>>> in each locale-specific encoding.
>>>
>>
>> I think we want to say ( to match existing practice ), that the execution
>> environment has an encoding / character set that is either the same or a
>> super set of the execution character set (same values but may have extra
>> members).
>> It is unclear that "local specific" currently say that.
>>
> I don't think the encoding interpretation of the above (which I think was
> the intended interpretation) actually matches existing practice (except
> perhaps for the "C" locale). That different locales present in runtime
> environments may encode characters within the basic execution character set
> differently is a practical reality (web search for "PPCS variant
> characters").
>

Unfortunately, when that's the case (and I agree that's the case more often
than we'd like, another good example is shift-jis/win-1251), string
literals cannot be interpreted properly by "locale specific" runtime
functions.
Such runtime function expects an encoding that is not the same as the
string literal, it cannot interpret it correctly, which can lead to
mojibake, etc.

I think this issue should be described somewhere (in the library?) and be
specifically UB.

>
>
>>
>>>
>>>>
>>>> What do you think ?
>>>>
>>> My current impression is that there may be a narrow-enough scope here
>>> that a separate paper could come out of this thread without pulling in the
>>> world.
>>>
>>
>> Agreed.
>> We may want to leave the local-specific part out of this paper to contain
>> the scope.
>>
> If we are removing the existing words that talk about "locale-specific",
> then we aren't really leaving the locale-specific part out of the paper. I
> am not sure the existing words for "locale-specific" are all that
> salvageable given the surrounding changes that we want.
>
>
>> We may have to resolve the wchar_t par first though
>>
> AFAICT, we can implement the wording improvement for the status quo of
> wchar_t without making it more difficult to handle the larger question of
> UTF-16, etc.
>
>
>

Received on 2021-01-30 13:16:56