On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Tue, 9 Jun 2020 at 22:17, Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:

On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney@gmail.com> wrote:
One thing I have realized while working on identifiers is that after conversion from whatever the sources are, lexing and parsing are symbolic. That is, 'a' doesn't have a value until it's rendered into a literal. That is " The values of the members of the execution character sets and the sets of additional members are locale-specific." http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into play when rendering the "execution character set" into a characters or strings. The execution character set and the source character set exist in the same logical space right now, and the "source character set" isn't what is in source files today.

Yep, and they don't have to have a value either. identifiers are not sorted etc.
Everything in lex is symbolic anyway the phases don't exist in practice.
However, the international representation being isomorphic to Unicode, it would be possible to describe in term of unicode with no observable behavior change.
I would like to allow characters not present in Unicode within character literals, string literals, comments, and header names. More abstractly, I would like to allow source -> encoding-used-for-output conversion.

Do you have an example of a use case you want to support?

I am still evaluating the round-trip mapping for EBCDIC.

There are 3 scenarios:
The character exists in no digital encoding yet - that is the paper implementation case - nothing that we can do. you can't have Klingon in your C++.
The character exists in a digital encoding but not in Unicode. This represents a small number of the Big5 encodings characters, almost all pertaining to places and people names. Unicode documents a mapping for windows's Big5 code page.
The character has a non-unique mapping to unicode, such as a conversion source -> unicode -> execution might be different from a conversion source->execution. In this case an implementation can convert source -> execution directly (taking care of UCNS and other escape sequences) - as it is otherwise not observable. This use case is actually common and important, notably for Shift-JIS and ambiguities introduced by Han Unification.

This last case is more broken by applying normalization. Do you have an example where the mapping does not work even if normalization is not applied?

aka 'a" doesn't have a value but it is still the 'a' abstract character which is represented by U+0061 in Unicode

I believe that sentence to be, however, very miss leading .
The execution encodings are implementation defined rather than locale specific.
It becomes locale-specific at runtime, and wording doesn't distinguish at all before compile time and runtime.

And yeah, source character set is... the minimal subset of the internal representation character set

On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

This is your friendly reminder that an SG16 telecon will be held tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion). To attend, visit https://bluejeans.com/140274541 at the start of the meeting.

The agenda for the meeting is:

Discuss terminology updates to strive for in C++23

P1859R0: Standard terminology character sets and encodings

Establish priorities for terms to address.

Establish a methodology for drafting wording updates.

Anticipated decisions to be made at this meeting include:

Prioritization of terminology updates to pursue.

Prior to tomorrow's meeting, please:

review P1859R0, particularly the proposed terminology.

think of other terminology changes to be considered.

think of how we can divide up the work for making terminology updates.
Hey!
Some feedback on P1859 after a first attempt at rewording the standard.

I will start to say that it seems entirely reasonable and useful to rewrite [lex] in terms of this new
terminology, and I think that trying to split that work would end up being counter productive ( however the library wording, which has its own definitions, could be reworded
independently). It is not that much work and I'm willing to do that work.

I found that I needed to use the following terms as defined by the Unicode Standard

* abstract character
* character set
* character encoding
* code units, codepoint

(we can bikeshed codepoint vs scalar values in the grammar as UCNs are technically scalar values)

The notion of character repertoire was not useful, that of character set is sufficient.

The notion of basic source character set could be removed, instead describing lexing after phase 1 entirely in terms of Unicode - a couple of library functions would have to be reworded, as well as a note in the description of user defined literals as they use "basic source character set" as a proxy to describe something else.

In particular, it is useful to separate entirely the notions of source encoding (which only exists in phase 1), internal representation, and literals encodings, there are 3 distinct and unrelated categories of character sets and encodings, which should have no relation to each other, beyond the existence of an uni-directional mapping from source to internal and internal to literal, so i think it would be valuable not to describe them in term of each other.

It is useful to be able to talk about the Unicode character set rather than "the character set described in ISO/IEC 10646"
The U+xxxx notation (+ unicode character names) is also useful to describe specific codepoints in the grammar.

Similarly, the basic execution character set is not a very useful notion as it is only used as a mechanism to describe which
characters are in the execution and execution wide character sets)
While I didn't try to do it, I think it make sense to rename execution character set in something like narrow/wide literal character sets, in the vein of what P1859 proposes.

It is useful to be able to talk about both literal encoding and literal character sets for each type of literal (a given encoding implicitly represents a character set).

The notion of dynamic encoding proposed by P1859 and its relation to the literal encoding are not needed in lex and might be better described in library, although a note in lex might not hurt

While I have not done that work yet, it seems useful to describe in the grammar in terms of unicode codepoints what constitutes a whitespace as well as a a new line

With the exception of "character literal" (and "abstract character" ) it seems valuable to systematically replace the use of the vacuous term "character" in the core wording.
That might be slightly more involved in library as "character" is used all over the place, usually to mean "code unit"

The pdf attached is meant to be illustrative of the scope of changes in the core wording, and also contain a number of design changes that are
mostly out of scope of the terminology discussion (It is also full of bugs). These design change will appear in a paper in more details soon™

It notably incorporates changes from P2029 which go a long way in improving the way character literals are described.

Hope that helps,
Corentin

Tom.
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16