sg16: Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 9 Jun 2020 23:21:29 +0200

On Tue, 9 Jun 2020 at 23:06, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>>
>>>>> One thing I have realized while working on identifiers is that after
>>>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>>>> is " The values of the members of the execution character sets and
>>>>> the sets of additional members are locale-specific.
>>>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes
>>>>> into play when rendering the "execution character set" into a characters or
>>>>> strings. The execution character set and the source character set exist in
>>>>> the same logical space right now, and the "source character set" isn't what
>>>>> is in source files today.
>>>>>
>>>>
>>>> Yep, and they don't have to have a value either. identifiers are not
>>>> sorted etc.
>>>> Everything in lex is symbolic anyway the phases don't exist in practice.
>>>> However, the international representation being isomorphic to Unicode,
>>>> it would be possible to describe in term of unicode with no observable
>>>> behavior change.
>>>>
>>> I would like to allow characters not present in Unicode within character
>>> literals, string literals, comments, and header names. More abstractly, I
>>> would like to allow source -> encoding-used-for-output conversion.
>>>
>>
>> Do you have an example of a use case you want to support?
>>
> I am still evaluating the round-trip mapping for EBCDIC.
>

I believe Unicode -> EBCDIC round trip perfectly using the process
described in https://www.unicode.org/reports/tr16/tr16-8.html
The tricky part is the control characters, which this TR maps to the C1
unicode control characters

>
>
>> There are 3 scenarios:
>>
>> - The character exists in no digital encoding yet - that is the paper
>> implementation case - nothing that we can do. you can't have Klingon in
>> your C++.
>> - The character exists in a digital encoding but not in Unicode. This
>> represents a small number of the Big5 encodings characters, almost all
>> pertaining to places and people names. Unicode documents a mapping for
>> windows's Big5 code page.
>> - The character has a non-unique mapping to unicode, such as a
>> conversion source -> unicode -> execution might be different from a
>> conversion source->execution. In this case an implementation can convert
>> source -> execution directly (taking care of UCNS and other escape
>> sequences) - as it is otherwise not observable. This use case is actually
>> common and important, notably for Shift-JIS and ambiguities introduced by
>> Han Unification.
>>
>> This last case is more broken by applying normalization. Do you have an
> example where the mapping does not work even if normalization is not
> applied?
>

Encodings like shift jis have multiple mapping for the same characters,
which might prevent round trip.
(here is a list)
https://books.google.fr/books?id=SA92uQqTB-AC&pg=PA287&lpg=PA287&dq=shift+jis+round+trip+U2252&source=bl&ots=GMCpYVq6Iv&redir_esc=y#v=onepage&q=shift%20jis%20round%20trip%20U2252&f=false

But a C++ compiler can conserve the exact same bytes when source and
encoding are the same

>
>
>>
>>
>>
>>
>>>
>>>
>>>> aka 'a" doesn't have a value but it is still the 'a' abstract character
>>>> which is represented by U+0061 in Unicode
>>>>
>>>> I believe that sentence to be, however, very miss leading .
>>>> The execution encodings are implementation defined rather than locale
>>>> specific.
>>>> It becomes locale-specific at runtime, and wording doesn't distinguish
>>>> at all before compile time and runtime.
>>>>
>>>> And yeah, source character set is... the minimal subset of the internal
>>>> representation character set
>>>>
>>>>
>>>>>
>>>>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <
>>>>> corentinjabot_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> This is your friendly reminder that an SG16 telecon will be held
>>>>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>>>>> To attend, visit https://bluejeans.com/140274541 at the start of
>>>>>>> the meeting.
>>>>>>>
>>>>>>> The agenda for the meeting is:
>>>>>>>
>>>>>>> - Discuss terminology updates to strive for in C++23
>>>>>>> - P1859R0: Standard terminology character sets and encodings
>>>>>>> <https://wg21.link/p1859>
>>>>>>> - Establish priorities for terms to address.
>>>>>>> - Establish a methodology for drafting wording updates.
>>>>>>>
>>>>>>> Anticipated decisions to be made at this meeting include:
>>>>>>>
>>>>>>> - Prioritization of terminology updates to pursue.
>>>>>>>
>>>>>>> Prior to tomorrow's meeting, please:
>>>>>>>
>>>>>>> - review P1859R0, particularly the proposed terminology.
>>>>>>> - think of other terminology changes to be considered.
>>>>>>> - think of how we can divide up the work for making terminology
>>>>>>> updates.
>>>>>>>
>>>>>>> Hey!
>>>>>> Some feedback on P1859 after a first attempt at rewording the
>>>>>> standard.
>>>>>>
>>>>>> I will start to say that it seems entirely reasonable and useful to
>>>>>> rewrite [lex] in terms of this new
>>>>>> terminology, and I think that trying to split that work would end up
>>>>>> being counter productive ( however the library wording, which has its own
>>>>>> definitions, could be reworded
>>>>>> independently). It is not that much work and I'm willing to do that
>>>>>> work.
>>>>>>
>>>>>> I found that I needed to use the following terms as defined by the
>>>>>> Unicode Standard
>>>>>>
>>>>>> * abstract character
>>>>>> * character set
>>>>>> * character encoding
>>>>>> * code units, codepoint
>>>>>>
>>>>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs
>>>>>> are technically scalar values)
>>>>>>
>>>>>> The notion of character repertoire was not useful, that of character
>>>>>> set is sufficient.
>>>>>>
>>>>>> The notion of basic source character set could be removed, instead
>>>>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>>>>> library functions would have to be reworded, as well as a note in the
>>>>>> description of user defined literals as they use "basic source character
>>>>>> set" as a proxy to describe something else.
>>>>>>
>>>>>> In particular, it is useful to separate entirely the notions of
>>>>>> source encoding (which only exists in phase 1), internal representation,
>>>>>> and literals encodings, there are 3 distinct and unrelated categories of
>>>>>> character sets and encodings, which should have no relation to each other,
>>>>>> beyond the existence of an uni-directional mapping from source to internal
>>>>>> and internal to literal, so i think it would be valuable not to describe
>>>>>> them in term of each other.
>>>>>>
>>>>>> It is useful to be able to talk about the Unicode character set
>>>>>> rather than "the character set described in ISO/IEC 10646"
>>>>>> The U+xxxx notation (+ unicode character names) is also useful to
>>>>>> describe specific codepoints in the grammar.
>>>>>>
>>>>>> Similarly, the basic execution character set is not a very useful
>>>>>> notion as it is only used as a mechanism to describe which
>>>>>> characters are in the execution and execution wide character sets)
>>>>>> While I didn't try to do it, I think it make sense to rename
>>>>>> execution character set in something like narrow/wide literal
>>>>>> character sets, in the vein of what P1859 proposes.
>>>>>>
>>>>>> It is useful to be able to talk about both literal encoding and
>>>>>> literal character sets for each type of literal (a given encoding
>>>>>> implicitly represents a character set).
>>>>>>
>>>>>> The notion of dynamic encoding proposed by P1859 and its relation to
>>>>>> the literal encoding are not needed in lex and might be better described in
>>>>>> library, although a note in lex might not hurt
>>>>>>
>>>>>> While I have not done that work yet, it seems useful to describe in
>>>>>> the grammar in terms of unicode codepoints what constitutes a whitespace as
>>>>>> well as a a new line
>>>>>>
>>>>>> With the exception of "character literal" (and "abstract character" )
>>>>>> it seems valuable to systematically replace the use of the vacuous term
>>>>>> "character" in the core wording.
>>>>>> That might be slightly more involved in library as "character" is
>>>>>> used all over the place, usually to mean "code unit"
>>>>>>
>>>>>> The pdf attached is meant to be illustrative of the scope of changes
>>>>>> in the core wording, and also contain a number of design changes that are
>>>>>> mostly out of scope of the terminology discussion (It is also full of
>>>>>> bugs). These design change will appear in a paper in more details soon™
>>>>>>
>>>>>> It notably incorporates changes from P2029 which go a long way in
>>>>>> improving the way character literals are described.
>>>>>>
>>>>>> Hope that helps,
>>>>>> Corentin
>>>>>>
>>>>>>
>>>>>>> Tom.
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>

Received on 2020-06-09 16:24:49