More out loud thinking.

There is a simple model where everything is done in the source character set directly in phase 1 to 5,
We don't need to describe what characters are in that set, because there is no process by which phase 1-5 creates extra characters.
And we do not need to talk about ucns before phase 4+ and 5.

Of course, we keep describing the grammar in our abstract repertoire +/- isomorphic to the Basic Latin 1 block.
Which I think is closer to what you describe Jens.

Of course, identifiers will have to be converted to unicode, and all other grammar elements will have to be referred to in terms of Unicode,
but we preserve byte value, and avoid debates about the semantic nature of ebcdic control characters.

Most importantly, we avoid the weird "super set of unicode" thingy.

The main drawback of that is that it makes Tom's proposal of a per-header encoding mechanism (such that each included header has its own source encoding)
a lot more difficult to realize - at least phase 5 would have to be repeated for each included headers.

And I don't think it helps describe implementations - a conversion to unicode does happen at some point on several implementations
Another issue is that we would need to describe the mapping from source to unicode in phase 4 (for identifiers) and phase 5(for utf literals notably)


On Sat, 20 Jun 2020 at 13:54, Corentin Jabot <corentinjabot@gmail.com> wrote:


On Sat, 20 Jun 2020 at 12:14, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 20/06/2020 10.05, Corentin Jabot wrote:
> On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <sg16@lists.isocpp.org <mailto:sg16@lists.isocpp.org>> wrote:

>     http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
>     section 5.2.1

>     A. Convert everything to UCNs in basic source characters as soon as possible, that is, in
>     translation phase 1.
>     B. Use native encodings where possible, UCNs otherwise.
>     C. Convert everything to wide characters as soon as possible using an internal encoding that
>     encompasses the entire source character set and all UCNs.
>
>
>     C++ has chosen model A, C has chosen model B.
>     The express intent is that which model is chosen is unobservable
>     for a conforming program.
>
>     Problems that will be solved with model B:
>      - raw string literals don't need some funny "reversal"
>      - stringizing can use the original spelling reliably
>
>      - fringe characters / encodings beyond Unicode can be transparently passed
>     through string literals
>
>
> There is no such thing :)

That seems a point of disagreement.  I thought Tom had a list of situations
quite recently where "Unicode alone" isn't enough, for example for certain
Big5 characters or Shift-JIS encodings, if you are in a world that just
wants transparent pass-through in string literals.

I agree on supporting transparent pass-through (note that not all compilers do that)
There are 3 different things here, which should not be mixed up:
  • Can a character be represented in the unicode code space. The exception to that is some amount of some big5 characters, that no compiler supports ( there exist many character sets called big5, windows "big5" code page has a complete mapping to Unicode, same thing for HKSCS, both of which are widely used.
  • Can a character be represented in the unicode code space by a character that has the exact same semantic. we can argue whether that is the case for a subset of EBCDIC characters that map to unicode codepoints whose semantic is "application defined".  But there is a mapping. The wider point is that Unicode control characters do not have semantics of their own and in that case Unicode acts as a pass through. The scenario in which an implementation wanted to support the semantic of multiple source character set control characters is exotic at best. Nevertheless the good people of Unicode pointed me towards http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf which is a general mechanism to encode arbitrary numbers of control character sequences, if an implementation wanted to do that.
  • Can a character UNIQUELY map TO Unicode and can a character UNIQUELY map FROM Unicode. This is not the case for for example Shift-JIS, which for historical reasons has duplicated characters, and unicode itself has duplicate characters.  Because of these duplication, a naive implementation that would follow the steps in order and do a verbatim mapping source -> unicode -> execution may lose information (aka which of the duplicated characters what used). The information lost is of the byte value rather than of the semantic.
There is a very important question here:
  • Should the standard mandates that byte value are preserved? ( I think this would put severe constraints on implementations )
If we do believe that, then talking of source encoding from phase 1 to 5 is useful, otherwise, if we believe that preservation of byte value is in the domain of QOI, an implementation can choose any path through phase 1 and 5 as long as the _semantic_ is preserved.
In phase 1, if one source character can be represented by more than one unicode codepoint sequence, an implementation can choose which ( in a scenario where phase 1 is semantically preserving, which we haven't decided to do yet). For example Å (Angstrom sign) can be either U+212B or U+00C5 in phase 1.

Similarly, ≒ (U+2252, Approximately Equal to or the Image Of) can be encoded in phase 5 as either 0x8790 or 0x81e0, an implementation can choose which.

If the internal Character set we choose in the standard is Unicode, the wording would lose the source information such that we wouldn't be able to prescribe
byte value information, but an implementation could, if it desired.

Overall the source question can be abstracted away:
Given the string literal "\u2252", assuming a shift jis narrow encoding, what should its byte value in the program be?
  • 0x8790
  • 0x81e0 
  • Implementation-defined?
 
 
Also, C seems to support such transparent pass-through, and I think there
is value to keep the C and C++ lexing behaviors are closely together as
possible. 

>     In short, C++ should switch to a model B', omitting any mention of "encoding"
>     or "multibyte characters" for the early phases.
>
>     Details:
>
>      - Define "source character set" as having the following distinct elements:
>
>         * all Unicode characters (where character means "as identified by a code point")
>
>         * invented/hypothetical characters for all other Unicode code points
>     (where "Unicode code point" means integer values in the range [0..0x10ffff],
>     excluding [0xd800-0xdfff])
>     Rationale: We want to be forward-compatible with future Unicode standards
>     that add more characters (and thus more assigned code points).
>
>         * an implementation-defined set of additional elements
>     (this is empty in a Unicode-only world)
>
>
> Again, 2 issues:
>   * This describes an internal encoding, not a source encoding. We should not talk about "source" past phase 1

It's still "source code", maybe internally represented, as opposed to
compiled machine code.  Given that we already have the term "(basic) source
character set" in the standard, I don't see a need to invent something new.
I'm particularly non-enthused about the phrase "internal encoding" (internal
relative to what?)

The goal is to make it clear that it is not the encoding of source files.
 

>   * There is no use case for a super set of Unicode. I described  the EBCDIC control character issue to the Unicode mailing list, it was qualified as "daft".

As I said earlier, it appears that Unicode says that control characters
are essentially out-of-scope for them (which I sympathize with, from their
viewpoint), so I would not turn to Unicode for insight how to handle
EBCDIC control characters that don't have a semantic equivalent in
Unicode.

In an EBCDIC-only world, I think there is a real conflict between
an EBCDIC control character mapped to a C1 control character in phase 1
the presence of a UCN naming that same control character somewhere
in the original source code.  The presence of the UCN may or may not
be intentional, I would like to allow implementations to flag this
situation.

Their position is that a compiler cannot know what the semantic of a codepoint which is a C1 or C0
control character is, as they don't have semantic.
A compiler could flag C0/C1 ucn escape sequences in literal, if they wanted too.
And again I'm trying to be pragmatic here. The work IBM is doing to get clang to support ebcdic is converting that ebcdic to utf-8.
 

>    All characters that a C++ compiler ever have, does or will care about has a mapping to Unicode.

But possibly not a unique mapping from Unicode back to the original character,
which seems useful for transparent string-literal pass-through.

Tom, I think the question whether there should be allowance for pass-through of
characters beyond Unicode should be up for a straw poll at the next telecon so
that we can make progress here.

Again, for the record, my position is "should be allowed, not mandated"
But i agree polling might help
 

>      - Define "basic source character set" as a subset of the "source character set"
>     with an explicit list of Unicode characters.
>
>
> There is no need for that construct -

These are the use-cases for the term "basic source character set":

 - keywords are spelled in the basic source character set

 - basic source characters can be represented in a single byte in plain "char" literals 

 - UCNs denoting characters in the basic source character set are ill-formed
[lex.charset] p2

 - Timezone parsing (the table in [time.parse], flag %Z)

 - do_widen / do_narrow [locale.ctype.virtuals]
So, the term seems to be useful as a descriptive tool when we're intentionally
referring to a that subset.

I agree that there is a need for a term, but, many of these can be better described in term of, for example "basic literal character set" or more accurately "basic literal character repertoire"
 

>  I would actually prefer we don't try to define the execution character set in terms of a basic one which is tied to the internal
> representation.

I don't think the standard needs to talk about internal "representation",
understood as specific code point values, at all, so I don't see the confusion
here.

We talk about source to refer to something that is not related to source
 
I don't think we need an execution character set per se, but it seems worthwhile to
be able to say "for this particular small set of ASCII characters, special constraints
for the literal encoding/representation apply".

Agreed.
And I think we might need slightly different definitions for some of the points you cited (which i was aware of, i meant that these things would need to be described differently, not that removing the term would have no ripple effect. My goal is that literals are not defined in term of source
 

>   We need however to specify that the grammar uses characters from basic latin in case anybody is confused.

Agreed.

>      - Translation phase 1 is reduced to
>
>     "Physical source file characters are mapped, in an implementation-defined manner,
>     to the <del>basic</del> source character set (introducing new-line characters for
>     end-of-line indicators) if necessary. The set of physical source file characters
>     accepted is implementation-defined."
>
>
> I think the "introducing new-line characters for end-of-line indicators" is confusing for  reversal in raw string literals

I'm defining the new-line question as out-of-scope for this particular
endeavor.

>      - Add a new phase 4+ that translates UCNs everywhere except
>     in raw string literals to (non-basic) source characters.
>     (This is needed to retain the status quo behavior that a UCN
>     cannot be formed by concatenating string literals.)
>
> Is there a value of not doing it for identifiers and string literals explicitly ?

"identifier" is ambiguous between phase 4 and phase 7 identifiers.
We can't translate UCNs in phase 4 (due to stringizing), but we want
a single spelling in phase 7 (so no confusion arises what goes into
linker symbols etc).  Previously, the single spelling was "UCNs
everywhere"; now, the single spelling is "(extended) characters
everywhere".

Right, i guess identifiers can be handled in phase 4+ or 7-
Which mean that pp-identifiers and pp-token can be composed of both ucns escape sequences and xid_start/xid_continue code points
 

For string and char literals, it simplifies the treatment a bit,
because we only have to discuss an abstract "source character"
instead of branching off into "oh, and we're mapping UCNs here"
every so often.

Hm... I'm wondering whether ## token concatenation can form a UCN,
e.g. via  bla\ ## u ## 0301 .  We should make that ill-formed
or somehow prevent interpretation as a UCN.

#define CONCAT(x,y) x##y 
CONCAT(\, U0001F431); 

Is valid in all implementations I tested, implementation-defined in the standard.
Do you see a reason to not allow it? in particular, as we move ucns handling later
in the process, it would make sense to allow these escape sequences to be created in phase 2 and 4 (might be evolutionary, there is a paper)



>> For example, in string literals, we want to allow Latin-1 encoding of umlauts expressed as a Unicode base vowel plus combining mark, if an implementation so chooses. 
>
> I think people in the mailing list agreed that individual c-char should be encoded independently (i thought that was your opinion too?), which I have come to agree with.

Fine with me.

>      - In phase 5, we should go to "literal encoding" right away:
>     There is no point in discussing a "character set" here; all
>     we're interested in is a (sequence of) integer values that end
>     up in the execution-time scalar value or array object corresponding
>     to the source-code literal.
>
>
> Yep, agreed, as long as you find a way to describe that encoding preserves semantic

What semantic?   A string literal consists of a sequence of source characters.
At the end, we get a sequence of integer values in an array object.
We can certainly weave a "corresponding" into the process, but that's
essentially vacuous handwaving from a normative standpoint.
(The preceding statement only applies to char and wchar_t encodings,
of course, not to the well-defined UTF-x encodings.)

In particular, an implementation can do any conversion it wants, including replacing characters that have no representation by another,
currently implementation defined behavior, which is something i would like to make ill-formed, i know it's evolutionary, there is a paper.
 

Jens