Subject: Re: What do we want from source to internal conversion?
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-15 10:17:42
On 6/15/20 7:14 AM, Corentin via SG16 wrote:
> It occured to me that while we discuss about terminology, we have
> notÂ clearly stated our
> UltimatelyÂ the observed behavior should be identical in most respect
> to what implementersÂ do today and in particular we don't want to
> change the mapping for a well formed program, to avoid silent change,
> and we want to keepÂ supporting the set of source encoding supported by
> existing implementations.
> But (I think) we should
> - Tighten the specification to describe a semantic rather than
> implementation defined mapping, while also making sure that mappings
> prescribed by vendors, or Unicode, such as UTF-EBCDIC are
> notÂ accidentallyÂ forbidden.
That sounds good to me.Â However, as we've recently learned, there are
certain implementation shenanigans that need to be accounted for:
* gcc stripping white space following a line continuation character (I
think you intend to adopt this behavior in general though).
> - Make the program ill-formed if the source file contains invalid code
> unit sequences or unmappable characters (the later of which is not a
> concern for existing implementations)
Existing practice is to allow ill-formed code unit sequences in portions
of the source code.Â For example, ill-formed UTF-8 in comments or string
Hubert has specifically requested better support for unmappable
characters, so I don't agree with the parenthetical.
> - Specify that unicode encoded (and especially utf-8 encoded) files
> preserve the sequenceÂ of code points in phase 1 (aka no normalization)
> - Mandate support for utf-8 files
I strongly want this split out to a separate paper as this is
evolutionary, not a wording fix or tightening of specification; this is
a separable concern (though one that should be kept in mind when
drafting wording for the other issues).
> - Find a better wording mechanism for the handling of raw
> stringÂ literals, which most likely means that we want to handle
> universal-character names that appear in the source files as verbatim
> escape sequences differently than what phase 1 calls extended characters.
Something along those lines, yes.
> - NotÂ actively prevent, nor mandate, that, if the source and the
> execution or wide execution encoding are the same, the sequence of
> bytes in string literal is preserved.
I'm having trouble connecting this with the other suggestions.Â If phase
1 is re-specified as suggested above (e.g., semantic mapping), then I'm
not sure what this gains or prevents.
> Did I miss anything?
This is probably a separable concern, but we did discuss the possibility
of decomposed UTF-8 sequences encountered in source input getting mapped
to a composed character in string literals for non-Unicode execution
character sets (e.g., decomposed Ã© getting mapped to composed Ã© in
SG16 list run by email@example.com