On Wed, Jun 3, 2020, 05:19 Tom Honermann <tom@honermann.net> wrote:

On 6/2/20 7:57 AM, Corentin Jabot via SG16 wrote:

On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <sg16@lists.isocpp.org> wrote:

Translation phase 1 maps source code to either a member of the
basic character set, or a UCN corresponding to that character.
What if there is no such UCN? Is that undefined behavior, or is
the program ill-formed? I can find nothing on this in [lex.phases]
where we describe processing the source through an implemetation
defined character mapping.

When we get to [lex.charset] we can see it is clearly ill-formed if
the produced UCN is invalid - is that supposed to be the resolution
here? Source must always map to a UCN, but the UCN need not
be valid, so we get an error parsing the (implied) UCN in a later
phase?

One more reason i want to rewrite phase 1.

2 things should be specified here:

> Any source file character not in the basic source character set is replaced by the
universal-character-name that designates that character.

This is wrong, characters may map to ucn sequences, not single ucns.

Good point. This can happen with a few legacy encodings. For example, Big5-HKCS includes a few characters that map to a Unicode code point pair.

Characters that do not have representation in Unicode should be ill-formed - with the caveat that implementers can do _anything_ in phase 0

Phase 0?

Sorry, let me clarify. Whatever we specify, we can't prevent implementers to do transformations before phase one, so phase 1 is mostly a guidance.

I don't think there is a way around that, nor should there be

Since source file encoding and phase 1 translation is implementation-defined, I don't think the standard can really say much about this scenario. But it seems pretty clear that, post phase 1, the standard specifies no facilities to handle a non-Unicode character.

Note that the existence of a mapping is different from the validity of a UCN

It is an implementation strategy to map characters without representation to nothing.

Other valid strategies would be to use the PUA to represent these characters

To give you an idea of where i want to be, here is a very early draft of what I think phase 1 and 2 should do, pending

a couple of design changes that EWG would have to look at

1. If the physical source character is the Unicode character set, each code point in the source
file is converted to the internal representation of that same code point. Codepoints that
are surrogate codepoints or invalid codepoints are ill-formed.
Otherwise, each abstract character in the source file is mapped in an implementation-
defined manner to a sequence of Unicode codepoint representing the same abstract
character. (introducing new-line characters for end-of-line indicators if necessary).
An implementation may use any internal encoding able to represent uniquely any Uni-
code codepoint. If an abstract character in the source file is not representable in the
Unicode character set, the program is ill-formed.
An implementation supports source files representing a sequence of UTF-8 code units.
Any additional physical source file character sets accepted are implementation-defined.
How the the character set of a source file is determined is implementation-defined.

2. Each implementation-defined line termination sequence of characters is replaced by a
LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
followed by a LINE FEED or at the end of a file is deleted, splicing physical source
lines to form logical source lines. Only the last backslash on any physical source line shall
be eligible for being part of such a splice. Except for splices reverted in a raw string literal,

if a splice results in a codepoint sequence that matches the syntax of a universal-character-
name, the behavior is implementation-defined. A source file that is not empty and that does not end
in a LINE FEED, or that ends in a LINE FEED immediately preceded by a BACKSLASH before any such splicing takes place, shall be processed as if an

additional LINE FEED were appended to the file.
Sequences of whitespace codepoints at the end of each line are removed.
Each universal-character-name is replaced by the Unicode codepoint it designates.

The sentence regarding the removal of trailing whitespace doesn't specify whether the removal occurs for physical or logical lines (before or after splicing).

Wording concerns aside, I think it would be helpful to list the intended behavioral changes.

Here: removing of trailing whitespaces is mandated, utf8 support is mandated.

Tom.

Corentin

AlisdairM
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16