On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom@honermann.net> wrote:

On 6/15/20 7:14 AM, Corentin via SG16 wrote:

Hey.

It occured to me that while we discuss about terminology, we have not clearly stated our

objectives

Ultimately the observed behavior should be identical in most respect to what implementers do today and in particular we don't want to change the mapping for a well formed program, to avoid silent change, and we want to keep supporting the set of source encoding supported by existing implementations.

Agreed.

But (I think) we should

- Tighten the specification to describe a semantic rather than implementation defined mapping, while also making sure that mappings prescribed by vendors, or Unicode, such as UTF-EBCDIC are not accidentally forbidden.

That sounds good to me. However, as we've recently learned, there are certain implementation shenanigans that need to be accounted for:

gcc stripping white space following a line continuation character (I think you intend to adopt this behavior in general though).

trigraphs

Whitespace stripping will be a ewg proposal in that mailing.

Hopefully trigraph can either happen in an undocumented phase 0, or be understood as part of "semantic mapping"

- Make the program ill-formed if the source file contains invalid code unit sequences or unmappable characters (the later of which is not a concern for existing implementations)

Existing practice is to allow ill-formed code unit sequences in portions of the source code. For example, ill-formed UTF-8 in comments or string literals.

I agree that it is evolutionary.

Hubert has specifically requested better support for unmappable characters, so I don't agree with the parenthetical.

I don't think that's a fair characterisation. Again there is a mapping for all characters in ebcdic. That mapping is prescriptive rather than semantic, but both Unicode and IBM agree on that mapping ( the codepoints they map to do not have associated semantic whatsoever and are meant to be used that way). The wording trick will be to make sure we don't prevent that mapping.

The only cases for which no mapping exist at all is for a subset of big5 characters (and maybe some exotic characters sets which i have yet to learn about)

- Specify that unicode encoded (and especially utf-8 encoded) files preserve the sequence of code points in phase 1 (aka no normalization)

Agreed.

- Mandate support for utf-8 files

I strongly want this split out to a separate paper as this is evolutionary, not a wording fix or tightening of specification; this is a separable concern (though one that should be kept in mind when drafting wording for the other issues).

I meant to list the end goals, some of it being evolutionary indeed

- Find a better wording mechanism for the handling of raw string literals, which most likely means that we want to handle universal-character names that appear in the source files as verbatim escape sequences differently than what phase 1 calls extended characters.

Something along those lines, yes.

- Not actively prevent, nor mandate, that, if the source and the execution or wide execution encoding are the same, the sequence of bytes in string literal is preserved.

I'm having trouble connecting this with the other suggestions. If phase 1 is re-specified as suggested above (e.g., semantic mapping), then I'm not sure what this gains or prevents.

I think it would be very hard for the wording to prevent implementers from preserving the bytes of string literal in this case, but we shouldn't try :)

Did I miss anything?

This is probably a separable concern, but we did discuss the possibility of decomposed UTF-8 sequences encountered in source input getting mapped to a composed character in string literals for non-Unicode execution character sets (e.g., decomposed é getting mapped to composed é in ISO-8859-1).

Yes, I was trying to focus on source -> internal conversion

Tom.