C++ Logo

SG16

Advanced search

Subject: Re: Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-09 15:59:31


On Tue, 9 Jun 2020 at 22:17, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
> sg16_at_[hidden]> wrote:
>
>>
>>
>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>
>>> One thing I have realized while working on identifiers is that after
>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>> is " The values of the members of the execution character sets and the
>>> sets of additional members are locale-specific.
>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>>> play when rendering the "execution character set" into a characters or
>>> strings. The execution character set and the source character set exist in
>>> the same logical space right now, and the "source character set" isn't what
>>> is in source files today.
>>>
>>
>> Yep, and they don't have to have a value either. identifiers are not
>> sorted etc.
>> Everything in lex is symbolic anyway the phases don't exist in practice.
>> However, the international representation being isomorphic to Unicode, it
>> would be possible to describe in term of unicode with no observable
>> behavior change.
>>
> I would like to allow characters not present in Unicode within character
> literals, string literals, comments, and header names. More abstractly, I
> would like to allow source -> encoding-used-for-output conversion.
>

Do you have an example of a use case you want to support?
There are 3 scenarios:

   - The character exists in no digital encoding yet - that is the paper
   implementation case - nothing that we can do. you can't have Klingon in
   your C++.
   - The character exists in a digital encoding but not in Unicode. This
   represents a small number of the Big5 encodings characters, almost all
   pertaining to places and people names. Unicode documents a mapping for
   windows's Big5 code page.
   - The character has a non-unique mapping to unicode, such as a
   conversion source -> unicode -> execution might be different from a
   conversion source->execution. In this case an implementation can convert
   source -> execution directly (taking care of UCNS and other escape
   sequences) - as it is otherwise not observable. This use case is actually
   common and important, notably for Shift-JIS and ambiguities introduced by
   Han Unification.

>
>
>> aka 'a" doesn't have a value but it is still the 'a' abstract character
>> which is represented by U+0061 in Unicode
>>
>> I believe that sentence to be, however, very miss leading .
>> The execution encodings are implementation defined rather than locale
>> specific.
>> It becomes locale-specific at runtime, and wording doesn't distinguish at
>> all before compile time and runtime.
>>
>> And yeah, source character set is... the minimal subset of the internal
>> representation character set
>>
>>
>>>
>>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> This is your friendly reminder that an SG16 telecon will be held
>>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>>> meeting.
>>>>>
>>>>> The agenda for the meeting is:
>>>>>
>>>>> - Discuss terminology updates to strive for in C++23
>>>>> - P1859R0: Standard terminology character sets and encodings
>>>>> <https://wg21.link/p1859>
>>>>> - Establish priorities for terms to address.
>>>>> - Establish a methodology for drafting wording updates.
>>>>>
>>>>> Anticipated decisions to be made at this meeting include:
>>>>>
>>>>> - Prioritization of terminology updates to pursue.
>>>>>
>>>>> Prior to tomorrow's meeting, please:
>>>>>
>>>>> - review P1859R0, particularly the proposed terminology.
>>>>> - think of other terminology changes to be considered.
>>>>> - think of how we can divide up the work for making terminology
>>>>> updates.
>>>>>
>>>>> Hey!
>>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>>
>>>> I will start to say that it seems entirely reasonable and useful to
>>>> rewrite [lex] in terms of this new
>>>> terminology, and I think that trying to split that work would end up
>>>> being counter productive ( however the library wording, which has its own
>>>> definitions, could be reworded
>>>> independently). It is not that much work and I'm willing to do that
>>>> work.
>>>>
>>>> I found that I needed to use the following terms as defined by the
>>>> Unicode Standard
>>>>
>>>> * abstract character
>>>> * character set
>>>> * character encoding
>>>> * code units, codepoint
>>>>
>>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>>> technically scalar values)
>>>>
>>>> The notion of character repertoire was not useful, that of character
>>>> set is sufficient.
>>>>
>>>> The notion of basic source character set could be removed, instead
>>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>>> library functions would have to be reworded, as well as a note in the
>>>> description of user defined literals as they use "basic source character
>>>> set" as a proxy to describe something else.
>>>>
>>>> In particular, it is useful to separate entirely the notions of source
>>>> encoding (which only exists in phase 1), internal representation, and
>>>> literals encodings, there are 3 distinct and unrelated categories of
>>>> character sets and encodings, which should have no relation to each other,
>>>> beyond the existence of an uni-directional mapping from source to internal
>>>> and internal to literal, so i think it would be valuable not to describe
>>>> them in term of each other.
>>>>
>>>> It is useful to be able to talk about the Unicode character set rather
>>>> than "the character set described in ISO/IEC 10646"
>>>> The U+xxxx notation (+ unicode character names) is also useful to
>>>> describe specific codepoints in the grammar.
>>>>
>>>> Similarly, the basic execution character set is not a very useful
>>>> notion as it is only used as a mechanism to describe which
>>>> characters are in the execution and execution wide character sets)
>>>> While I didn't try to do it, I think it make sense to rename execution
>>>> character set in something like narrow/wide literal character sets, in the
>>>> vein of what P1859 proposes.
>>>>
>>>> It is useful to be able to talk about both literal encoding and literal
>>>> character sets for each type of literal (a given encoding
>>>> implicitly represents a character set).
>>>>
>>>> The notion of dynamic encoding proposed by P1859 and its relation to
>>>> the literal encoding are not needed in lex and might be better described in
>>>> library, although a note in lex might not hurt
>>>>
>>>> While I have not done that work yet, it seems useful to describe in the
>>>> grammar in terms of unicode codepoints what constitutes a whitespace as
>>>> well as a a new line
>>>>
>>>> With the exception of "character literal" (and "abstract character" )
>>>> it seems valuable to systematically replace the use of the vacuous term
>>>> "character" in the core wording.
>>>> That might be slightly more involved in library as "character" is used
>>>> all over the place, usually to mean "code unit"
>>>>
>>>> The pdf attached is meant to be illustrative of the scope of changes in
>>>> the core wording, and also contain a number of design changes that are
>>>> mostly out of scope of the terminology discussion (It is also full of
>>>> bugs). These design change will appear in a paper in more details soonâ„¢
>>>>
>>>> It notably incorporates changes from P2029 which go a long way in
>>>> improving the way character literals are described.
>>>>
>>>> Hope that helps,
>>>> Corentin
>>>>
>>>>
>>>>> Tom.
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>



SG16 list run by sg16-owner@lists.isocpp.org