C++ Logo


Advanced search

Re: [SG16] Handling of non-basic characters in early translation phases

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 20 Jun 2020 17:14:14 -0400
On Sat, Jun 20, 2020 at 4:42 PM Corentin Jabot <corentinjabot_at_[hidden]>

> On Sat, Jun 20, 2020, 21:27 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>> On Sat, Jun 20, 2020 at 9:48 AM Corentin Jabot via SG16 <
>> sg16_at_[hidden]> wrote:
>>> On Sat, 20 Jun 2020 at 15:16, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>> Again, they don't have semantic from a Unicode viewpoint (which is
>>>> fine),
>>>> but in a larger system context, they sure have semantics (otherwise
>>>> they wouldn't have a reason to exist in the first place). How much of
>>>> those semantics is known to the compiler is a separate question.
>>>> > A compiler could flag C0/C1 ucn escape sequences in literal, if they
>>>> wanted too.
>>>> > And again I'm trying to be pragmatic here. The work IBM is doing to
>>>> get clang to support ebcdic is converting that ebcdic to utf-8.
>>> Until that compiler is relatively complete and substantial adoption of
>> it occurs, the amount of feedback available from users would be minimal.
>>>> Maybe that's because it's the only option under the status quo of C++,
>>>> which needs to tunnel everything through UCNs.
>>> I think it's more about the cost/benefits of supporting that use case.
>>> I really would like to know from IBM people if and how much they are
>>> actually concerned about this point as it is driving many decisions.
>> Standardization is not meant to produce short-term decisions. The ability
>> to have the choice of differentiating between an EBCDIC control character
>> physically present in the source and a UCN physically present in the
>> source, even if they are the same C1 control according to the CDRA mapping,
>> should not be prevented by the process of standardization. The choice is a
>> point of design that the implementer should be able to make in consultation
>> with their users.
> At the same time needs should drive design, not the other way around.
> Do people find CDRA unsuitable?
> My concern here is that there is a huge body of experience in the area of
> control characters (IBM, Unicode, ISO, ECMA), and we seem to be saying it
> is insufficient?
> And we have 10+ years of at least 2 compilers which use utf-8 internally,
> and successfully (I just discovered that GCC compiled on an EBCDIC platform
> uses utf EBCDIC).
> Also we know that there is and have been an increase in the use of
> Unicode, so I don't think it's fair to say that we are taking short term
> decisions.
The increase in use of Unicode is precisely why the "10+ years" of
experience is not necessarily translatable to the future. A lot of
experience comes from "pure EBCDIC" and "pure extended-ASCII".

>> A design that requires funnelling through UCNs would mean that there are
>> no characters that cannot appear in an "Unicode" string; however, the user
>> intent may be that the EBCDIC control characters (if physically present)
>> are not okay in that context (whereas the UCN would be).
> How would they arrive there in the first place?
Through the process of modifying code to increase usage of "Unicode"

> If I don't want emojis in my Unicode strings, should the compiler enforce
> it?
I don't think the Unicode emoji-related codepoints have the "identity"

> Note that if we handle ucns escape sequence at a later phase it becomes
> possible for a compiler to make the distinction between EBCDIC control
> characters and Unicode C1 escape sequences.
That allows the compiler to note the distinction, but it does not allow the
compiler to consider the "Unicode" string with the physical EBCDIC
character to be considered ill-formed if we still require the physical
EBCDIC character to mapped to a UCN codepoint at an early phase.

Received on 2020-06-20 16:17:43