On 6/15/20 7:14 AM, Corentin via SG16 wrote:

Hey.

It occured to me that while we discuss about terminology, we have not clearly stated our

objectives

Ultimately the observed behavior should be identical in most respect to what implementers do today and in particular we don't want to change the mapping for a well formed program, to avoid silent change, and we want to keep supporting the set of source encoding supported by existing implementations.

Agreed.

But (I think) we should

- Tighten the specification to describe a semantic rather than implementation defined mapping, while also making sure that mappings prescribed by vendors, or Unicode, such as UTF-EBCDIC are not accidentally forbidden.

That sounds good to me. However, as we've recently learned, there are certain implementation shenanigans that need to be accounted for:

gcc stripping white space following a line continuation character (I think you intend to adopt this behavior in general though).
trigraphs

- Make the program ill-formed if the source file contains invalid code unit sequences or unmappable characters (the later of which is not a concern for existing implementations)

Existing practice is to allow ill-formed code unit sequences in portions of the source code. For example, ill-formed UTF-8 in comments or string literals.

Hubert has specifically requested better support for unmappable characters, so I don't agree with the parenthetical.

- Specify that unicode encoded (and especially utf-8 encoded) files preserve the sequence of code points in phase 1 (aka no normalization)

Agreed.

- Mandate support for utf-8 files

I strongly want this split out to a separate paper as this is evolutionary, not a wording fix or tightening of specification; this is a separable concern (though one that should be kept in mind when drafting wording for the other issues).

- Find a better wording mechanism for the handling of raw string literals, which most likely means that we want to handle universal-character names that appear in the source files as verbatim escape sequences differently than what phase 1 calls extended characters.

Something along those lines, yes.

- Not actively prevent, nor mandate, that, if the source and the execution or wide execution encoding are the same, the sequence of bytes in string literal is preserved.

I'm having trouble connecting this with the other suggestions. If phase 1 is re-specified as suggested above (e.g., semantic mapping), then I'm not sure what this gains or prevents.

Did I miss anything?

This is probably a separable concern, but we did discuss the possibility of decomposed UTF-8 sequences encountered in source input getting mapped to a composed character in string literals for non-Unicode execution character sets (e.g., decomposed é getting mapped to composed é in ISO-8859-1).

Tom.