I was unclear, unfortunately. The standard term "basic source character set" is the only source character set. Other things representing characters that will be translated are escape sequences and universal character names, which are sequences of basic source characters. There are a few places in the standard that mention "source character set" without the basic qualification, but it's pretty clear that the same set is meant.

The execution character set, on the other hand, has basic, corresponding to the basic source character set, and the extended superset, which is often unqualified.

Also, "source character set" is not the same as what's actually in a source file, leading to further confusion. That mapping is phase 1. If we were to switch to a code point basis, this would be a decode from source file encoding to code points. Plus some 'implementation defined' handwaving to preserve wetware implementations.

And 26 because I think it will take more than a couple years to understand all the implications of switching from "source characters" and "universal character names" to "code points". At least not without just starting over, which is equally scary.

On Wed, May 27, 2020 at 1:36 AM Tom Honermann <tom@honermann.net> wrote:

On 5/27/20 12:08 AM, Steve Downey via SG16 wrote:

With respect to P1859:
-Basic source character set
-: The abstract characters that must be representable in the _character set_ used for source code
+Source character set
+: The abstract characters that must be representable in the internal _character set_ used after phase 1 of translation. All characters not in the source character set are converted to universal-character-names, which are made up of characters from the basic character set. The abstract parser only sees characters in the source character set.

There is no "Basic" source character set. There is the character set the lexer and parser uses that is available after the implementation defined conversion from whatever was presented as source.
I don't think anyone understands that, outside CWG.

I'm not sure I'm following. The standard does define basic source character set in [lex.charset]p1. Are you proposing that it be renamed to just source character set?

I think we will need a term for the encoding of source files. We could use source file encoding for that, but I'm a little concerned about these two terms being confused.

Having more precision around the values emitted into narrow, wide, and uN literals from the execution character set, and what happens when that fails I still believe would be useful.

I think P2029 may address that.

Perhaps for 26 we could rewrite entirely in terms of processing code points and occasionally "orginal spelling". It would be nice if the logical model was closer to what the physical model is.

Why wait for 26? :)

Tom.

On Tue, May 26, 2020 at 1:37 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

This is your friendly reminder that an SG16 telecon will be held tomorrow, Wednesday May 27th, at 19:30 UTC (timezone conversion). To attend, visit https://bluejeans.com/140274541 at the start of the meeting.

Steve will circulate a draft revision of P1949 on the SG16 mailing list today.

The agenda for the meeting is:

D1949R4: C++ Identifier Syntax using Unicode Standard Annex 31

Review updates since the April 22nd review.

Discuss terminology updates to strive for in C++23

P1859R0: Standard terminology character sets and encodings

Establish priorities for terms to address.

Establish a methodology for drafting wording updates.

Anticipated decisions to be made at this meeting include:

Whether to forward the new draft revision of P1949 to EWG.

Prior to tomorrow's meeting, please:

review Steve's draft revision.

review P1859R0, particularly the proposed terminology.

think of other terminology changes to be considered.

think of how we can divide up the work for making terminology updates.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16