Subject: Re: What do we want from source to internal conversion?
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-06-15 10:38:36
On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]> wrote:
> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
> It occured to me that while we discuss about terminology, we have
> not clearly stated our
> Ultimately the observed behavior should be identical in most respect to
> what implementers do today and in particular we don't want to change the
> mapping for a well formed program, to avoid silent change, and we want to
> keep supporting the set of source encoding supported by existing
> But (I think) we should
> - Tighten the specification to describe a semantic rather than
> implementation defined mapping, while also making sure that mappings
> prescribed by vendors, or Unicode, such as UTF-EBCDIC are
> not accidentally forbidden.
> That sounds good to me. However, as we've recently learned, there are
> certain implementation shenanigans that need to be accounted for:
> - gcc stripping white space following a line continuation character (I
> think you intend to adopt this behavior in general though).
> - trigraphs
> Whitespace stripping will be a ewg proposal in that mailing.
Hopefully trigraph can either happen in an undocumented phase 0, or be
understood as part of "semantic mapping"
> - Make the program ill-formed if the source file contains invalid code
> unit sequences or unmappable characters (the later of which is not a
> concern for existing implementations)
> Existing practice is to allow ill-formed code unit sequences in portions
> of the source code. For example, ill-formed UTF-8 in comments or string
I agree that it is evolutionary.
> Hubert has specifically requested better support for unmappable
> characters, so I don't agree with the parenthetical.
I don't think that's a fair characterisation. Again there is a mapping for
all characters in ebcdic. That mapping is prescriptive rather than
semantic, but both Unicode and IBM agree on that mapping ( the codepoints
they map to do not have associated semantic whatsoever and are meant to be
used that way). The wording trick will be to make sure we don't prevent
The only cases for which no mapping exist at all is for a subset of big5
characters (and maybe some exotic characters sets which i have yet to
> - Specify that unicode encoded (and especially utf-8 encoded) files
> preserve the sequence of code points in phase 1 (aka no normalization)
> - Mandate support for utf-8 files
> I strongly want this split out to a separate paper as this is
> evolutionary, not a wording fix or tightening of specification; this is a
> separable concern (though one that should be kept in mind when drafting
> wording for the other issues).
I meant to list the end goals, some of it being evolutionary indeed
> - Find a better wording mechanism for the handling of raw string literals,
> which most likely means that we want to handle universal-character names
> that appear in the source files as verbatim escape sequences differently
> than what phase 1 calls extended characters.
> Something along those lines, yes.
> - Not actively prevent, nor mandate, that, if the source and the execution
> or wide execution encoding are the same, the sequence of bytes in string
> literal is preserved.
> I'm having trouble connecting this with the other suggestions. If phase 1
> is re-specified as suggested above (e.g., semantic mapping), then I'm not
> sure what this gains or prevents.
I think it would be very hard for the wording to prevent implementers from
preserving the bytes of string literal in this case, but we shouldn't try :)
> Did I miss anything?
> This is probably a separable concern, but we did discuss the possibility
> of decomposed UTF-8 sequences encountered in source input getting mapped to
> a composed character in string literals for non-Unicode execution character
> sets (e.g., decomposed Ã© getting mapped to composed Ã© in ISO-8859-1).
Yes, I was trying to focus on source -> internal conversion
SG16 list run by email@example.com