Hello, 
Thanks Jens, feeling like we are making progress !
A few comment below

On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

I've had a look at the C99 rationale (thanks to Hubert for the hint)
with respect to handling non-basic characters in the early
translation phases.

http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
section 5.2.1

The terminology used is a bit outdated, for example the term
"collation sequence" appears to refer to "code point", but the
choice of options seems informative:

A. Convert everything to UCNs in basic source characters as soon as possible, that is, in
translation phase 1.
B. Use native encodings where possible, UCNs otherwise.
C. Convert everything to wide characters as soon as possible using an internal encoding that
encompasses the entire source character set and all UCNs.


C++ has chosen model A, C has chosen model B.
The express intent is that which model is chosen is unobservable
for a conforming program.

Problems that will be solved with model B:
 - raw string literals don't need some funny "reversal"
 - stringizing can use the original spelling reliably
 
 - fringe characters / encodings beyond Unicode can be transparently passed
through string literals

There is no such thing :)
 

In short, C++ should switch to a model B', omitting any mention of "encoding"
or "multibyte characters" for the early phases.

Details:

 - Define "source character set" as having the following distinct elements:

    * all Unicode characters (where character means "as identified by a code point")

    * invented/hypothetical characters for all other Unicode code points
(where "Unicode code point" means integer values in the range [0..0x10ffff],
excluding [0xd800-0xdfff])
Rationale: We want to be forward-compatible with future Unicode standards
that add more characters (and thus more assigned code points).

    * an implementation-defined set of additional elements
(this is empty in a Unicode-only world)

Again, 2 issues:
  * This describes an internal encoding, not a source encoding. We should not talk about "source" past phase 1
  * There is no use case for a super set of Unicode. I described  the EBCDIC control character issue to the Unicode mailing list, it was qualified as "daft". All characters that a C++ compiler ever have, does or will care about has a mapping to Unicode.
 
 - Define "basic source character set" as a subset of the "source character set"
with an explicit list of Unicode characters.

There is no need for that construct - I would actually prefer we don't try to define the execution character set in terms of a basic one which is tied to the internal
representation. We need however to specify that the grammar uses characters from basic latin in case anybody is confused.

Strongly agreed about being explicit about unicode characters!

 - Translation phase 1 is reduced to

"Physical source file characters are mapped, in an implementation-defined manner,
to the <del>basic</del> source character set (introducing new-line characters for
end-of-line indicators) if necessary. The set of physical source file characters
accepted is implementation-defined."

I think the "introducing new-line characters for end-of-line indicators" is confusing for  reversal in raw string literals
 
 - Modify the "identifier" lexing treatment to handle (non-basic)
source characters and equivalent UCNs the same; we can't fold
UCNs to source characters just yet because of preprocessor
stringizing, which wants to recover the "original spelling".

Thats a good point i hadn't consider
 
 - Add a new phase 4+ that translates UCNs everywhere except
in raw string literals to (non-basic) source characters.
(This is needed to retain the status quo behavior that a UCN
cannot be formed by concatenating string literals.)

Is there a value of not doing it for identifiers and string literals explicitly ?
 

 - Revert the order of translation phases 5 and 6: We should concatenate
string literals first so that (e.g.) combining marks are actually next
to the character they apply to before converting to the execution
encoding.  For example, in string literals, we want to allow Latin-1
encoding of umlauts expressed as a Unicode base vowel plus combining mark,
if an implementation so chooses. 

+1. (I talk about that issue in https://wg21.link/p2178r0)

> For example, in string literals, we want to allow Latin-1 encoding of umlauts expressed as a Unicode base vowel plus combining mark, if an implementation so chooses. 

I think people in the mailing list agreed that individual c-char should be encoded independently (i thought that was your opinion too?), which I have come to agree with.

 

 - In phase 5, we should go to "literal encoding" right away:
There is no point in discussing a "character set" here; all
we're interested in is a (sequence of) integer values that end
up in the execution-time scalar value or array object corresponding
to the source-code literal.

Yep, agreed, as long as you find a way to describe that encoding preserves semantic
 

 - Any mention of "locale-dependent" during compilation should
be removed: Either this is subsumed by "implementation-defined"
in phase 1, or it's a concept referring to the runtime locale,
which is purely a library I/O matter.

:)
 

 - Carefully review [lex] and [cpp] for further fall-out adjustments.
The trouble is that several papers addressing [lex] are in flight,
for example P2029, which doesn't help contain the conflicts.

Yep, organizationally this is a bit of a nightmare :(
 

This approach does fix the UCN reversal in raw string literals, but does
not fix the line splicing reversal for same.  The latter is a separate
can of worms, in my view.
 
A much less confusing can  

As a matter of editorial clarity, we should use the prefix "Unicode" for
any term we intend to use unmodified from the Unicode standard,
e.g. "Unicode code point".

We should never refer to terms "code point" without referring to an explicit character set, nor "code unit" without referring to an explicit encoding 
 
If the term "character set" is too loaded and transports more meaning
than the intended "(abstract) set of (abstract) characters", [lex.charset]
needs a larger rewrite.  I'm not sold on that.

No, it's fine
 

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16