Subject: Re: Handling of non-basic characters in early translation phases
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-20 15:42:00
On Sat, Jun 20, 2020, 21:27 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> On Sat, Jun 20, 2020 at 9:48 AM Corentin Jabot via SG16 <
> sg16_at_[hidden]> wrote:
>> On Sat, 20 Jun 2020 at 15:16, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>> Again, they don't have semantic from a Unicode viewpoint (which is fine),
>>> but in a larger system context, they sure have semantics (otherwise
>>> they wouldn't have a reason to exist in the first place). How much of
>>> those semantics is known to the compiler is a separate question.
>>> > A compiler could flag C0/C1 ucn escape sequences in literal, if they
>>> wanted too.
>>> > And again I'm trying to be pragmatic here. The work IBM is doing to
>>> get clang to support ebcdic is converting that ebcdic to utf-8.
>> Until that compiler is relatively complete and substantial adoption of it
> occurs, the amount of feedback available from users would be minimal.
>>> Maybe that's because it's the only option under the status quo of C++,
>>> which needs to tunnel everything through UCNs.
>> I think it's more about the cost/benefits of supporting that use case.
>> I really would like to know from IBM people if and how much they are
>> actually concerned about this point as it is driving many decisions.
> Standardization is not meant to produce short-term decisions. The ability
> to have the choice of differentiating between an EBCDIC control character
> physically present in the source and a UCN physically present in the
> source, even if they are the same C1 control according to the CDRA mapping,
> should not be prevented by the process of standardization. The choice is a
> point of design that the implementer should be able to make in consultation
> with their users.
At the same time needs should drive design, not the other way around.
Do people find CDRA unsuitable?
My concern here is that there is a huge body of experience in the area of
control characters (IBM, Unicode, ISO, ECMA), and we seem to be saying it
And we have 10+ years of at least 2 compilers which use utf-8 internally,
and successfully (I just discovered that GCC compiled on an EBCDIC platform
uses utf EBCDIC).
Also we know that there is and have been an increase in the use of Unicode,
so I don't think it's fair to say that we are taking short term decisions.
> A design that requires funnelling through UCNs would mean that there are
> no characters that cannot appear in an "Unicode" string; however, the user
> intent may be that the EBCDIC control characters (if physically present)
> are not okay in that context (whereas the UCN would be).
How would they arrive there in the first place?
If I don't want emojis in my Unicode strings, should the compiler enforce
Note that if we handle ucns escape sequence at a later phase it becomes
possible for a compiler to make the distinction between EBCDIC control
characters and Unicode C1 escape sequences.
SG16 list run by email@example.com