Date: Wed, 10 Jun 2020 04:03:18 +0200
On Wed, Jun 10, 2020, 03:38 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:
> On Tue, Jun 9, 2020 at 9:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Wed, 10 Jun 2020 at 01:39, Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Tue, Jun 9, 2020 at 7:12 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> While I understand what you are asking for, and I agree it doesn't seem
>>>> unreasonable, I also don't see how that it works with the machinery today?
>>>>
>>> I am not saying that the C++ wording today works for this by the letter
>>> (except for heavy-handed interpretations of phase 1). I consider it to be a
>>> bug that it doesn't.
>>>
>>>
>>>> All characters outside the basic source character set are mapped to
>>>> universal-character-names that are named by Unicode scalar values.
>>>> We'd need a mechanism to get back to the completely untranslated
>>>> original source.
>>>>
>>>
>> I think we have that mechanism already.
>> We have a mapping source -> universal-character-names (which for your
>> interest is specified both by IBM and Unicode), and
>> the universal-character-names -> execution mapping, which again is fully
>> specified.
>> I think that is enough to do, if desired, a direct source -> execution
>> which is bytes preserving, as it is not observable whether it was done or
>> not.
>>
> It is round-trippable but at the cost of one-way (during compilation)
> conversions that are not semantically preserving. Even these are
> justifiable, but I think they deserve to be called out. Which is to say
> that the paper should document that these concerns were considered and not
> simply dismiss the issue.
>
Definitely, I will document that better, thanks for the feedback!
'\u0096' becoming '\x36': I suppose this could be justified for the case
> where the user application is expected to have its output subjected to
> automatic conversion, e.g., via SSH to a non-EBCDIC terminal.
>
I think the other (implementation defined) strategy is to make it Ill
formed as non representable.
Ideally, the wording should leave enough wiggle room for EBCDIC platforms
to make these decisions!
>
> For the much rarer case of u'<0x36>' (character literal that, in the
> physical source file, contains the EBCDIC control character) becoming
> u'\u0096': I suppose this could be justified for the case where the user
> source was originally non-EBCDIC, but subjected to conversion into EBCDIC.
>
>
>>
>>
>>> I think this is similar to how raw string literals need some sort of
>>> mechanism.
>>>
>>>
>>>>
>>>> On Tue, Jun 9, 2020, 18:32 Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <
>>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>>
>>>>>>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <
>>>>>>> corentinjabot_at_[hidden]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>>>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>>>>>>> sg16_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One thing I have realized while working on identifiers is that
>>>>>>>>>>> after conversion from whatever the sources are, lexing and parsing are
>>>>>>>>>>> symbolic. That is, 'a' doesn't have a value until it's rendered into a
>>>>>>>>>>> literal. That is " The values of the members of the execution
>>>>>>>>>>> character sets and the sets of additional members are locale-specific
>>>>>>>>>>> . <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only
>>>>>>>>>>> comes into play when rendering the "execution character set" into a
>>>>>>>>>>> characters or strings. The execution character set and the source character
>>>>>>>>>>> set exist in the same logical space right now, and the "source character
>>>>>>>>>>> set" isn't what is in source files today.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yep, and they don't have to have a value either. identifiers are
>>>>>>>>>> not sorted etc.
>>>>>>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>>>>>>> practice.
>>>>>>>>>> However, the international representation being isomorphic to
>>>>>>>>>> Unicode, it would be possible to describe in term of unicode with no
>>>>>>>>>> observable behavior change.
>>>>>>>>>>
>>>>>>>>> I would like to allow characters not present in Unicode within
>>>>>>>>> character literals, string literals, comments, and header names. More
>>>>>>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>>>>>>> conversion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Do you have an example of a use case you want to support?
>>>>>>>>
>>>>>>> I am still evaluating the round-trip mapping for EBCDIC.
>>>>>>>
>>>>>>
>>>>>> I believe Unicode -> EBCDIC round trip perfectly using the process
>>>>>> described in https://www.unicode.org/reports/tr16/tr16-8.html
>>>>>> The tricky part is the control characters, which this TR maps to the
>>>>>> C1 unicode control characters
>>>>>>
>>>>> I'm not questioning the ability to round-trip. I am questioning the
>>>>> ability to avoid conflating certain EBCDIC control characters with certain
>>>>> C1 control characters. For example, it seems clear to me that U+0096 START
>>>>> OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
>>>>> usage, but the mapping of these to, respectively, Numeric Backspace and
>>>>> Graphic Escape does not retain semantic meaning. If such EBCDIC characters
>>>>> appear within a literal that should be encoded in a Unicode encoding, I
>>>>> find it potentially questionable if the string is considered well-formed. I
>>>>> have similar thoughts for the case where a UCN escape for such a C1 control
>>>>> character appears in a string that is to be encoded in EBCDIC.
>>>>>
>>>>> In other words, I do not consider the mapping (which is useful if you
>>>>> track out-of-band whether the data was originally EBCDIC or not) to
>>>>> establish the presence of the EBCDIC control characters in Unicode. These
>>>>> opinions do not necessarily represent those of IBM.
>>>>>
>>>>> -- HT
>>>>>
>>>>
wrote:
> On Tue, Jun 9, 2020 at 9:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Wed, 10 Jun 2020 at 01:39, Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Tue, Jun 9, 2020 at 7:12 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> While I understand what you are asking for, and I agree it doesn't seem
>>>> unreasonable, I also don't see how that it works with the machinery today?
>>>>
>>> I am not saying that the C++ wording today works for this by the letter
>>> (except for heavy-handed interpretations of phase 1). I consider it to be a
>>> bug that it doesn't.
>>>
>>>
>>>> All characters outside the basic source character set are mapped to
>>>> universal-character-names that are named by Unicode scalar values.
>>>> We'd need a mechanism to get back to the completely untranslated
>>>> original source.
>>>>
>>>
>> I think we have that mechanism already.
>> We have a mapping source -> universal-character-names (which for your
>> interest is specified both by IBM and Unicode), and
>> the universal-character-names -> execution mapping, which again is fully
>> specified.
>> I think that is enough to do, if desired, a direct source -> execution
>> which is bytes preserving, as it is not observable whether it was done or
>> not.
>>
> It is round-trippable but at the cost of one-way (during compilation)
> conversions that are not semantically preserving. Even these are
> justifiable, but I think they deserve to be called out. Which is to say
> that the paper should document that these concerns were considered and not
> simply dismiss the issue.
>
Definitely, I will document that better, thanks for the feedback!
'\u0096' becoming '\x36': I suppose this could be justified for the case
> where the user application is expected to have its output subjected to
> automatic conversion, e.g., via SSH to a non-EBCDIC terminal.
>
I think the other (implementation defined) strategy is to make it Ill
formed as non representable.
Ideally, the wording should leave enough wiggle room for EBCDIC platforms
to make these decisions!
>
> For the much rarer case of u'<0x36>' (character literal that, in the
> physical source file, contains the EBCDIC control character) becoming
> u'\u0096': I suppose this could be justified for the case where the user
> source was originally non-EBCDIC, but subjected to conversion into EBCDIC.
>
>
>>
>>
>>> I think this is similar to how raw string literals need some sort of
>>> mechanism.
>>>
>>>
>>>>
>>>> On Tue, Jun 9, 2020, 18:32 Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <
>>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>>
>>>>>>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <
>>>>>>> corentinjabot_at_[hidden]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>>>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>>>>>>> sg16_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One thing I have realized while working on identifiers is that
>>>>>>>>>>> after conversion from whatever the sources are, lexing and parsing are
>>>>>>>>>>> symbolic. That is, 'a' doesn't have a value until it's rendered into a
>>>>>>>>>>> literal. That is " The values of the members of the execution
>>>>>>>>>>> character sets and the sets of additional members are locale-specific
>>>>>>>>>>> . <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only
>>>>>>>>>>> comes into play when rendering the "execution character set" into a
>>>>>>>>>>> characters or strings. The execution character set and the source character
>>>>>>>>>>> set exist in the same logical space right now, and the "source character
>>>>>>>>>>> set" isn't what is in source files today.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yep, and they don't have to have a value either. identifiers are
>>>>>>>>>> not sorted etc.
>>>>>>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>>>>>>> practice.
>>>>>>>>>> However, the international representation being isomorphic to
>>>>>>>>>> Unicode, it would be possible to describe in term of unicode with no
>>>>>>>>>> observable behavior change.
>>>>>>>>>>
>>>>>>>>> I would like to allow characters not present in Unicode within
>>>>>>>>> character literals, string literals, comments, and header names. More
>>>>>>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>>>>>>> conversion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Do you have an example of a use case you want to support?
>>>>>>>>
>>>>>>> I am still evaluating the round-trip mapping for EBCDIC.
>>>>>>>
>>>>>>
>>>>>> I believe Unicode -> EBCDIC round trip perfectly using the process
>>>>>> described in https://www.unicode.org/reports/tr16/tr16-8.html
>>>>>> The tricky part is the control characters, which this TR maps to the
>>>>>> C1 unicode control characters
>>>>>>
>>>>> I'm not questioning the ability to round-trip. I am questioning the
>>>>> ability to avoid conflating certain EBCDIC control characters with certain
>>>>> C1 control characters. For example, it seems clear to me that U+0096 START
>>>>> OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
>>>>> usage, but the mapping of these to, respectively, Numeric Backspace and
>>>>> Graphic Escape does not retain semantic meaning. If such EBCDIC characters
>>>>> appear within a literal that should be encoded in a Unicode encoding, I
>>>>> find it potentially questionable if the string is considered well-formed. I
>>>>> have similar thoughts for the case where a UCN escape for such a C1 control
>>>>> character appears in a string that is to be encoded in EBCDIC.
>>>>>
>>>>> In other words, I do not consider the mapping (which is useful if you
>>>>> track out-of-band whether the data was originally EBCDIC or not) to
>>>>> establish the presence of the EBCDIC control characters in Unicode. These
>>>>> opinions do not necessarily represent those of IBM.
>>>>>
>>>>> -- HT
>>>>>
>>>>
Received on 2020-06-09 21:06:39