sg16: Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 9 Jun 2020 18:31:56 -0400

On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>>>
>>>>>> One thing I have realized while working on identifiers is that after
>>>>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>>>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>>>>> is " The values of the members of the execution character sets and
>>>>>> the sets of additional members are locale-specific.
>>>>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes
>>>>>> into play when rendering the "execution character set" into a characters or
>>>>>> strings. The execution character set and the source character set exist in
>>>>>> the same logical space right now, and the "source character set" isn't what
>>>>>> is in source files today.
>>>>>>
>>>>>
>>>>> Yep, and they don't have to have a value either. identifiers are not
>>>>> sorted etc.
>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>> practice.
>>>>> However, the international representation being isomorphic to Unicode,
>>>>> it would be possible to describe in term of unicode with no observable
>>>>> behavior change.
>>>>>
>>>> I would like to allow characters not present in Unicode within
>>>> character literals, string literals, comments, and header names. More
>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>> conversion.
>>>>
>>>
>>> Do you have an example of a use case you want to support?
>>>
>> I am still evaluating the round-trip mapping for EBCDIC.
>>
>
> I believe Unicode -> EBCDIC round trip perfectly using the process
> described in https://www.unicode.org/reports/tr16/tr16-8.html
> The tricky part is the control characters, which this TR maps to the C1
> unicode control characters
>
I'm not questioning the ability to round-trip. I am questioning the ability
to avoid conflating certain EBCDIC control characters with certain C1
control characters. For example, it seems clear to me that U+0096 START OF
GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
usage, but the mapping of these to, respectively, Numeric Backspace and
Graphic Escape does not retain semantic meaning. If such EBCDIC characters
appear within a literal that should be encoded in a Unicode encoding, I
find it potentially questionable if the string is considered well-formed. I
have similar thoughts for the case where a UCN escape for such a C1 control
character appears in a string that is to be encoded in EBCDIC.

In other words, I do not consider the mapping (which is useful if you track
out-of-band whether the data was originally EBCDIC or not) to establish the
presence of the EBCDIC control characters in Unicode. These opinions do not
necessarily represent those of IBM.

-- HT

Received on 2020-06-09 17:35:23