C++ Logo

sg16

Advanced search

Re: [SG16] Reminder: SG16 telecon tomorrow (Wednesday, 2020-06-10)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 9 Jun 2020 18:02:47 +0200
On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> This is your friendly reminder that an SG16 telecon will be held tomorrow,
> Wednesday June 10th, at 19:30 UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
> To attend, visit https://bluejeans.com/140274541 at the start of the
> meeting.
>
> The agenda for the meeting is:
>
> - Discuss terminology updates to strive for in C++23
> - P1859R0: Standard terminology character sets and encodings
> <https://wg21.link/p1859>
> - Establish priorities for terms to address.
> - Establish a methodology for drafting wording updates.
>
> Anticipated decisions to be made at this meeting include:
>
> - Prioritization of terminology updates to pursue.
>
> Prior to tomorrow's meeting, please:
>
> - review P1859R0, particularly the proposed terminology.
> - think of other terminology changes to be considered.
> - think of how we can divide up the work for making terminology
> updates.
>
> Hey!
Some feedback on P1859 after a first attempt at rewording the standard.

I will start to say that it seems entirely reasonable and useful to rewrite
[lex] in terms of this new
terminology, and I think that trying to split that work would end up being
counter productive ( however the library wording, which has its own
definitions, could be reworded
independently). It is not that much work and I'm willing to do that work.

I found that I needed to use the following terms as defined by the Unicode
Standard

* abstract character
* character set
* character encoding
* code units, codepoint

(we can bikeshed codepoint vs scalar values in the grammar as UCNs are
technically scalar values)

The notion of character repertoire was not useful, that of character set is
sufficient.

The notion of basic source character set could be removed, instead
describing lexing after phase 1 entirely in terms of Unicode - a couple of
library functions would have to be reworded, as well as a note in the
description of user defined literals as they use "basic source character
set" as a proxy to describe something else.

In particular, it is useful to separate entirely the notions of source
encoding (which only exists in phase 1), internal representation, and
literals encodings, there are 3 distinct and unrelated categories of
character sets and encodings, which should have no relation to each other,
beyond the existence of an uni-directional mapping from source to internal
and internal to literal, so i think it would be valuable not to describe
them in term of each other.

It is useful to be able to talk about the Unicode character set rather than
"the character set described in ISO/IEC 10646"
The U+xxxx notation (+ unicode character names) is also useful to describe
specific codepoints in the grammar.

Similarly, the basic execution character set is not a very useful notion as
it is only used as a mechanism to describe which
characters are in the execution and execution wide character sets)
While I didn't try to do it, I think it make sense to rename execution
character set in something like narrow/wide literal character sets, in the
vein of what P1859 proposes.

It is useful to be able to talk about both literal encoding and literal
character sets for each type of literal (a given encoding
implicitly represents a character set).

The notion of dynamic encoding proposed by P1859 and its relation to the
literal encoding are not needed in lex and might be better described in
library, although a note in lex might not hurt

While I have not done that work yet, it seems useful to describe in the
grammar in terms of unicode codepoints what constitutes a whitespace as
well as a a new line

With the exception of "character literal" (and "abstract character" ) it
seems valuable to systematically replace the use of the vacuous term
"character" in the core wording.
That might be slightly more involved in library as "character" is used all
over the place, usually to mean "code unit"

The pdf attached is meant to be illustrative of the scope of changes in the
core wording, and also contain a number of design changes that are
mostly out of scope of the terminology discussion (It is also full of
bugs). These design change will appear in a paper in more details soon™

It notably incorporates changes from P2029 which go a long way in improving
the way character literals are described.

Hope that helps,
Corentin


> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-09 11:06:10