Date: Tue, 9 Jun 2020 19:39:38 -0400
On Tue, Jun 9, 2020 at 7:12 PM Steve Downey <sdowney_at_[hidden]> wrote:
> While I understand what you are asking for, and I agree it doesn't seem
> unreasonable, I also don't see how that it works with the machinery today?
>
I am not saying that the C++ wording today works for this by the letter
(except for heavy-handed interpretations of phase 1). I consider it to be a
bug that it doesn't.
> All characters outside the basic source character set are mapped to
> universal-character-names that are named by Unicode scalar values.
> We'd need a mechanism to get back to the completely untranslated original
> source.
>
I think this is similar to how raw string literals need some sort of
mechanism.
>
> On Tue, Jun 9, 2020, 18:32 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>
>>>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> One thing I have realized while working on identifiers is that
>>>>>>>> after conversion from whatever the sources are, lexing and parsing are
>>>>>>>> symbolic. That is, 'a' doesn't have a value until it's rendered into a
>>>>>>>> literal. That is " The values of the members of the execution
>>>>>>>> character sets and the sets of additional members are locale-specific
>>>>>>>> . <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes
>>>>>>>> into play when rendering the "execution character set" into a characters or
>>>>>>>> strings. The execution character set and the source character set exist in
>>>>>>>> the same logical space right now, and the "source character set" isn't what
>>>>>>>> is in source files today.
>>>>>>>>
>>>>>>>
>>>>>>> Yep, and they don't have to have a value either. identifiers are not
>>>>>>> sorted etc.
>>>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>>>> practice.
>>>>>>> However, the international representation being isomorphic to
>>>>>>> Unicode, it would be possible to describe in term of unicode with no
>>>>>>> observable behavior change.
>>>>>>>
>>>>>> I would like to allow characters not present in Unicode within
>>>>>> character literals, string literals, comments, and header names. More
>>>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>>>> conversion.
>>>>>>
>>>>>
>>>>> Do you have an example of a use case you want to support?
>>>>>
>>>> I am still evaluating the round-trip mapping for EBCDIC.
>>>>
>>>
>>> I believe Unicode -> EBCDIC round trip perfectly using the process
>>> described in https://www.unicode.org/reports/tr16/tr16-8.html
>>> The tricky part is the control characters, which this TR maps to the C1
>>> unicode control characters
>>>
>> I'm not questioning the ability to round-trip. I am questioning the
>> ability to avoid conflating certain EBCDIC control characters with certain
>> C1 control characters. For example, it seems clear to me that U+0096 START
>> OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
>> usage, but the mapping of these to, respectively, Numeric Backspace and
>> Graphic Escape does not retain semantic meaning. If such EBCDIC characters
>> appear within a literal that should be encoded in a Unicode encoding, I
>> find it potentially questionable if the string is considered well-formed. I
>> have similar thoughts for the case where a UCN escape for such a C1 control
>> character appears in a string that is to be encoded in EBCDIC.
>>
>> In other words, I do not consider the mapping (which is useful if you
>> track out-of-band whether the data was originally EBCDIC or not) to
>> establish the presence of the EBCDIC control characters in Unicode. These
>> opinions do not necessarily represent those of IBM.
>>
>> -- HT
>>
>
> While I understand what you are asking for, and I agree it doesn't seem
> unreasonable, I also don't see how that it works with the machinery today?
>
I am not saying that the C++ wording today works for this by the letter
(except for heavy-handed interpretations of phase 1). I consider it to be a
bug that it doesn't.
> All characters outside the basic source character set are mapped to
> universal-character-names that are named by Unicode scalar values.
> We'd need a mechanism to get back to the completely untranslated original
> source.
>
I think this is similar to how raw string literals need some sort of
mechanism.
>
> On Tue, Jun 9, 2020, 18:32 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>
>>>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> One thing I have realized while working on identifiers is that
>>>>>>>> after conversion from whatever the sources are, lexing and parsing are
>>>>>>>> symbolic. That is, 'a' doesn't have a value until it's rendered into a
>>>>>>>> literal. That is " The values of the members of the execution
>>>>>>>> character sets and the sets of additional members are locale-specific
>>>>>>>> . <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes
>>>>>>>> into play when rendering the "execution character set" into a characters or
>>>>>>>> strings. The execution character set and the source character set exist in
>>>>>>>> the same logical space right now, and the "source character set" isn't what
>>>>>>>> is in source files today.
>>>>>>>>
>>>>>>>
>>>>>>> Yep, and they don't have to have a value either. identifiers are not
>>>>>>> sorted etc.
>>>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>>>> practice.
>>>>>>> However, the international representation being isomorphic to
>>>>>>> Unicode, it would be possible to describe in term of unicode with no
>>>>>>> observable behavior change.
>>>>>>>
>>>>>> I would like to allow characters not present in Unicode within
>>>>>> character literals, string literals, comments, and header names. More
>>>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>>>> conversion.
>>>>>>
>>>>>
>>>>> Do you have an example of a use case you want to support?
>>>>>
>>>> I am still evaluating the round-trip mapping for EBCDIC.
>>>>
>>>
>>> I believe Unicode -> EBCDIC round trip perfectly using the process
>>> described in https://www.unicode.org/reports/tr16/tr16-8.html
>>> The tricky part is the control characters, which this TR maps to the C1
>>> unicode control characters
>>>
>> I'm not questioning the ability to round-trip. I am questioning the
>> ability to avoid conflating certain EBCDIC control characters with certain
>> C1 control characters. For example, it seems clear to me that U+0096 START
>> OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
>> usage, but the mapping of these to, respectively, Numeric Backspace and
>> Graphic Escape does not retain semantic meaning. If such EBCDIC characters
>> appear within a literal that should be encoded in a Unicode encoding, I
>> find it potentially questionable if the string is considered well-formed. I
>> have similar thoughts for the case where a UCN escape for such a C1 control
>> character appears in a string that is to be encoded in EBCDIC.
>>
>> In other words, I do not consider the mapping (which is useful if you
>> track out-of-band whether the data was originally EBCDIC or not) to
>> establish the presence of the EBCDIC control characters in Unicode. These
>> opinions do not necessarily represent those of IBM.
>>
>> -- HT
>>
>
Received on 2020-06-09 18:43:03