C++ Logo

sg16

Advanced search

Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 9 Jun 2020 20:03:31 -0400
On Tue, Jun 9, 2020, 19:39 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Tue, Jun 9, 2020 at 7:12 PM Steve Downey <sdowney_at_[hidden]> wrote:
>
>> While I understand what you are asking for, and I agree it doesn't seem
>> unreasonable, I also don't see how that it works with the machinery today?
>>
> I am not saying that the C++ wording today works for this by the letter
> (except for heavy-handed interpretations of phase 1). I consider it to be a
> bug that it doesn't.
>
>
>> All characters outside the basic source character set are mapped to
>> universal-character-names that are named by Unicode scalar values.
>> We'd need a mechanism to get back to the completely untranslated original
>> source.
>>
> I think this is similar to how raw string literals need some sort of
> mechanism.
>

Yes, a similar mechanism to places where we distinguish between
universal-character-names and the original spelling. If we can nail
something down in phase 1, it would be reasonable to allow an
implementation to not go through Unicode to transcode from actual source to
literal encoding.

>
>
>>
>> On Tue, Jun 9, 2020, 18:32 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
>> wrote:
>>
>>> On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 9 Jun 2020 at 23:06, Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>>
>>>>>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>>>>>> sg16_at_[hidden]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> One thing I have realized while working on identifiers is that
>>>>>>>>> after conversion from whatever the sources are, lexing and parsing are
>>>>>>>>> symbolic. That is, 'a' doesn't have a value until it's rendered into a
>>>>>>>>> literal. That is " The values of the members of the execution
>>>>>>>>> character sets and the sets of additional members are locale-specific
>>>>>>>>> . <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes
>>>>>>>>> into play when rendering the "execution character set" into a characters or
>>>>>>>>> strings. The execution character set and the source character set exist in
>>>>>>>>> the same logical space right now, and the "source character set" isn't what
>>>>>>>>> is in source files today.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yep, and they don't have to have a value either. identifiers are
>>>>>>>> not sorted etc.
>>>>>>>> Everything in lex is symbolic anyway the phases don't exist in
>>>>>>>> practice.
>>>>>>>> However, the international representation being isomorphic to
>>>>>>>> Unicode, it would be possible to describe in term of unicode with no
>>>>>>>> observable behavior change.
>>>>>>>>
>>>>>>> I would like to allow characters not present in Unicode within
>>>>>>> character literals, string literals, comments, and header names. More
>>>>>>> abstractly, I would like to allow source -> encoding-used-for-output
>>>>>>> conversion.
>>>>>>>
>>>>>>
>>>>>> Do you have an example of a use case you want to support?
>>>>>>
>>>>> I am still evaluating the round-trip mapping for EBCDIC.
>>>>>
>>>>
>>>> I believe Unicode -> EBCDIC round trip perfectly using the process
>>>> described in https://www.unicode.org/reports/tr16/tr16-8.html
>>>> The tricky part is the control characters, which this TR maps to the C1
>>>> unicode control characters
>>>>
>>> I'm not questioning the ability to round-trip. I am questioning the
>>> ability to avoid conflating certain EBCDIC control characters with certain
>>> C1 control characters. For example, it seems clear to me that U+0096 START
>>> OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended
>>> usage, but the mapping of these to, respectively, Numeric Backspace and
>>> Graphic Escape does not retain semantic meaning. If such EBCDIC characters
>>> appear within a literal that should be encoded in a Unicode encoding, I
>>> find it potentially questionable if the string is considered well-formed. I
>>> have similar thoughts for the case where a UCN escape for such a C1 control
>>> character appears in a string that is to be encoded in EBCDIC.
>>>
>>> In other words, I do not consider the mapping (which is useful if you
>>> track out-of-band whether the data was originally EBCDIC or not) to
>>> establish the presence of the EBCDIC control characters in Unicode. These
>>> opinions do not necessarily represent those of IBM.
>>>
>>> -- HT
>>>
>>

Received on 2020-06-09 19:06:48