sg16: Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 9 Jun 2020 12:45:09 -0400

One thing I have realized while working on identifiers is that after
conversion from whatever the sources are, lexing and parsing are symbolic.
That is, 'a' doesn't have a value until it's rendered into a literal. That
is " The values of the members of the execution character sets and the sets
of additional members are locale-specific.
<http://eel.is/c++draft/lex.charset#3.sentence-5>"
http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into play
when rendering the "execution character set" into a characters or strings.
The execution character set and the source character set exist in the same
logical space right now, and the "source character set" isn't what is in
source files today.

On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <sg16_at_[hidden]>
> wrote:
>
>> This is your friendly reminder that an SG16 telecon will be held
>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>> To attend, visit https://bluejeans.com/140274541 at the start of the
>> meeting.
>>
>> The agenda for the meeting is:
>>
>> - Discuss terminology updates to strive for in C++23
>> - P1859R0: Standard terminology character sets and encodings
>> <https://wg21.link/p1859>
>> - Establish priorities for terms to address.
>> - Establish a methodology for drafting wording updates.
>>
>> Anticipated decisions to be made at this meeting include:
>>
>> - Prioritization of terminology updates to pursue.
>>
>> Prior to tomorrow's meeting, please:
>>
>> - review P1859R0, particularly the proposed terminology.
>> - think of other terminology changes to be considered.
>> - think of how we can divide up the work for making terminology
>> updates.
>>
>> Hey!
> Some feedback on P1859 after a first attempt at rewording the standard.
>
> I will start to say that it seems entirely reasonable and useful to
> rewrite [lex] in terms of this new
> terminology, and I think that trying to split that work would end up being
> counter productive ( however the library wording, which has its own
> definitions, could be reworded
> independently). It is not that much work and I'm willing to do that work.
>
> I found that I needed to use the following terms as defined by the Unicode
> Standard
>
> * abstract character
> * character set
> * character encoding
> * code units, codepoint
>
> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
> technically scalar values)
>
> The notion of character repertoire was not useful, that of character set
> is sufficient.
>
> The notion of basic source character set could be removed, instead
> describing lexing after phase 1 entirely in terms of Unicode - a couple of
> library functions would have to be reworded, as well as a note in the
> description of user defined literals as they use "basic source character
> set" as a proxy to describe something else.
>
> In particular, it is useful to separate entirely the notions of source
> encoding (which only exists in phase 1), internal representation, and
> literals encodings, there are 3 distinct and unrelated categories of
> character sets and encodings, which should have no relation to each other,
> beyond the existence of an uni-directional mapping from source to internal
> and internal to literal, so i think it would be valuable not to describe
> them in term of each other.
>
> It is useful to be able to talk about the Unicode character set rather
> than "the character set described in ISO/IEC 10646"
> The U+xxxx notation (+ unicode character names) is also useful to describe
> specific codepoints in the grammar.
>
> Similarly, the basic execution character set is not a very useful notion
> as it is only used as a mechanism to describe which
> characters are in the execution and execution wide character sets)
> While I didn't try to do it, I think it make sense to rename execution
> character set in something like narrow/wide literal character sets, in the
> vein of what P1859 proposes.
>
> It is useful to be able to talk about both literal encoding and literal
> character sets for each type of literal (a given encoding
> implicitly represents a character set).
>
> The notion of dynamic encoding proposed by P1859 and its relation to the
> literal encoding are not needed in lex and might be better described in
> library, although a note in lex might not hurt
>
> While I have not done that work yet, it seems useful to describe in the
> grammar in terms of unicode codepoints what constitutes a whitespace as
> well as a a new line
>
> With the exception of "character literal" (and "abstract character" ) it
> seems valuable to systematically replace the use of the vacuous term
> "character" in the core wording.
> That might be slightly more involved in library as "character" is used all
> over the place, usually to mean "code unit"
>
> The pdf attached is meant to be illustrative of the scope of changes in
> the core wording, and also contain a number of design changes that are
> mostly out of scope of the terminology discussion (It is also full of
> bugs). These design change will appear in a paper in more details soon™
>
> It notably incorporates changes from P2029 which go a long way in
> improving the way character literals are described.
>
> Hope that helps,
> Corentin
>
>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2020-06-09 11:48:30