sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 20 Jun 2020 17:14:14 -0400

On Sat, Jun 20, 2020 at 4:42 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Sat, Jun 20, 2020, 21:27 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Sat, Jun 20, 2020 at 9:48 AM Corentin Jabot via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>>
>>>
>>> On Sat, 20 Jun 2020 at 15:16, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>>
>>>> Again, they don't have semantic from a Unicode viewpoint (which is
>>>> fine),
>>>> but in a larger system context, they sure have semantics (otherwise
>>>> they wouldn't have a reason to exist in the first place). How much of
>>>> those semantics is known to the compiler is a separate question.
>>>>
>>>> > A compiler could flag C0/C1 ucn escape sequences in literal, if they
>>>> wanted too.
>>>> > And again I'm trying to be pragmatic here. The work IBM is doing to
>>>> get clang to support ebcdic is converting that ebcdic to utf-8.
>>>>
>>> Until that compiler is relatively complete and substantial adoption of
>> it occurs, the amount of feedback available from users would be minimal.
>>
>>
>>>
>>>> Maybe that's because it's the only option under the status quo of C++,
>>>> which needs to tunnel everything through UCNs.
>>>>
>>>
>>> I think it's more about the cost/benefits of supporting that use case.
>>> I really would like to know from IBM people if and how much they are
>>> actually concerned about this point as it is driving many decisions.
>>>
>> Standardization is not meant to produce short-term decisions. The ability
>> to have the choice of differentiating between an EBCDIC control character
>> physically present in the source and a UCN physically present in the
>> source, even if they are the same C1 control according to the CDRA mapping,
>> should not be prevented by the process of standardization. The choice is a
>> point of design that the implementer should be able to make in consultation
>> with their users.
>>
>
> At the same time needs should drive design, not the other way around.
> Do people find CDRA unsuitable?
> My concern here is that there is a huge body of experience in the area of
> control characters (IBM, Unicode, ISO, ECMA), and we seem to be saying it
> is insufficient?
> And we have 10+ years of at least 2 compilers which use utf-8 internally,
> and successfully (I just discovered that GCC compiled on an EBCDIC platform
> uses utf EBCDIC).
> Also we know that there is and have been an increase in the use of
> Unicode, so I don't think it's fair to say that we are taking short term
> decisions.
>
The increase in use of Unicode is precisely why the "10+ years" of
experience is not necessarily translatable to the future. A lot of
experience comes from "pure EBCDIC" and "pure extended-ASCII".

>
>> A design that requires funnelling through UCNs would mean that there are
>> no characters that cannot appear in an "Unicode" string; however, the user
>> intent may be that the EBCDIC control characters (if physically present)
>> are not okay in that context (whereas the UCN would be).
>>
>
> How would they arrive there in the first place?
>
Through the process of modifying code to increase usage of "Unicode"
strings.

> If I don't want emojis in my Unicode strings, should the compiler enforce
> it?
>
I don't think the Unicode emoji-related codepoints have the "identity"
issue.

>
> Note that if we handle ucns escape sequence at a later phase it becomes
> possible for a compiler to make the distinction between EBCDIC control
> characters and Unicode C1 escape sequences.
>
That allows the compiler to note the distinction, but it does not allow the
compiler to consider the "Unicode" string with the physical EBCDIC
character to be considered ill-formed if we still require the physical
EBCDIC character to mapped to a UCN codepoint at an early phase.

Received on 2020-06-20 16:17:43