sg16: Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 9 Jun 2020 19:14:22 +0200

On Tue, 9 Jun 2020 at 19:08, Steve Downey <sdowney_at_[hidden]> wrote:

> In early implementations of modernish C, once locales had been invented,
> the C compiler's locale, distinct from the "C" locale, was what was used to
> determine the values of characters.
>

And I believe this is still the case in some implementations.
But given that the execution character set is implementation defined, It
doesn't seem necessary to specify whether that encoding is derived from
* a flag
* the source file encoding
* the compiler locale
* some other heuristic

all of which are, i believe, used strategies.

>
> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>
>>> One thing I have realized while working on identifiers is that after
>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>> is " The values of the members of the execution character sets and the
>>> sets of additional members are locale-specific.
>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>>> play when rendering the "execution character set" into a characters or
>>> strings. The execution character set and the source character set exist in
>>> the same logical space right now, and the "source character set" isn't what
>>> is in source files today.
>>>
>>
>> Yep, and they don't have to have a value either. identifiers are not
>> sorted etc.
>> Everything in lex is symbolic anyway the phases don't exist in practice.
>> However, the international representation being isomorphic to Unicode, it
>> would be possible to describe in term of unicode with no observable
>> behavior change.
>> aka 'a" doesn't have a value but it is still the 'a' abstract character
>> which is represented by U+0061 in Unicode
>>
>> I believe that sentence to be, however, very miss leading .
>> The execution encodings are implementation defined rather than locale
>> specific.
>> It becomes locale-specific at runtime, and wording doesn't distinguish at
>> all before compile time and runtime.
>>
>> And yeah, source character set is... the minimal subset of the internal
>> representation character set
>>
>>
>>>
>>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> This is your friendly reminder that an SG16 telecon will be held
>>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>>> meeting.
>>>>>
>>>>> The agenda for the meeting is:
>>>>>
>>>>> - Discuss terminology updates to strive for in C++23
>>>>> - P1859R0: Standard terminology character sets and encodings
>>>>> <https://wg21.link/p1859>
>>>>> - Establish priorities for terms to address.
>>>>> - Establish a methodology for drafting wording updates.
>>>>>
>>>>> Anticipated decisions to be made at this meeting include:
>>>>>
>>>>> - Prioritization of terminology updates to pursue.
>>>>>
>>>>> Prior to tomorrow's meeting, please:
>>>>>
>>>>> - review P1859R0, particularly the proposed terminology.
>>>>> - think of other terminology changes to be considered.
>>>>> - think of how we can divide up the work for making terminology
>>>>> updates.
>>>>>
>>>>> Hey!
>>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>>
>>>> I will start to say that it seems entirely reasonable and useful to
>>>> rewrite [lex] in terms of this new
>>>> terminology, and I think that trying to split that work would end up
>>>> being counter productive ( however the library wording, which has its own
>>>> definitions, could be reworded
>>>> independently). It is not that much work and I'm willing to do that
>>>> work.
>>>>
>>>> I found that I needed to use the following terms as defined by the
>>>> Unicode Standard
>>>>
>>>> * abstract character
>>>> * character set
>>>> * character encoding
>>>> * code units, codepoint
>>>>
>>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>>> technically scalar values)
>>>>
>>>> The notion of character repertoire was not useful, that of character
>>>> set is sufficient.
>>>>
>>>> The notion of basic source character set could be removed, instead
>>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>>> library functions would have to be reworded, as well as a note in the
>>>> description of user defined literals as they use "basic source character
>>>> set" as a proxy to describe something else.
>>>>
>>>> In particular, it is useful to separate entirely the notions of source
>>>> encoding (which only exists in phase 1), internal representation, and
>>>> literals encodings, there are 3 distinct and unrelated categories of
>>>> character sets and encodings, which should have no relation to each other,
>>>> beyond the existence of an uni-directional mapping from source to internal
>>>> and internal to literal, so i think it would be valuable not to describe
>>>> them in term of each other.
>>>>
>>>> It is useful to be able to talk about the Unicode character set rather
>>>> than "the character set described in ISO/IEC 10646"
>>>> The U+xxxx notation (+ unicode character names) is also useful to
>>>> describe specific codepoints in the grammar.
>>>>
>>>> Similarly, the basic execution character set is not a very useful
>>>> notion as it is only used as a mechanism to describe which
>>>> characters are in the execution and execution wide character sets)
>>>> While I didn't try to do it, I think it make sense to rename execution
>>>> character set in something like narrow/wide literal character sets, in the
>>>> vein of what P1859 proposes.
>>>>
>>>> It is useful to be able to talk about both literal encoding and literal
>>>> character sets for each type of literal (a given encoding
>>>> implicitly represents a character set).
>>>>
>>>> The notion of dynamic encoding proposed by P1859 and its relation to
>>>> the literal encoding are not needed in lex and might be better described in
>>>> library, although a note in lex might not hurt
>>>>
>>>> While I have not done that work yet, it seems useful to describe in the
>>>> grammar in terms of unicode codepoints what constitutes a whitespace as
>>>> well as a a new line
>>>>
>>>> With the exception of "character literal" (and "abstract character" )
>>>> it seems valuable to systematically replace the use of the vacuous term
>>>> "character" in the core wording.
>>>> That might be slightly more involved in library as "character" is used
>>>> all over the place, usually to mean "code unit"
>>>>
>>>> The pdf attached is meant to be illustrative of the scope of changes in
>>>> the core wording, and also contain a number of design changes that are
>>>> mostly out of scope of the terminology discussion (It is also full of
>>>> bugs). These design change will appear in a paper in more details soon™
>>>>
>>>> It notably incorporates changes from P2029 which go a long way in
>>>> improving the way character literals are described.
>>>>
>>>> Hope that helps,
>>>> Corentin
>>>>
>>>>
>>>>> Tom.
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>

Received on 2020-06-09 12:17:42