sg16: Re: [SG16] What do we want from source to internal conversion?

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 15 Jun 2020 11:17:42 -0400

On 6/15/20 7:14 AM, Corentin via SG16 wrote:
> Hey.
>
> It occured to me that while we discuss about terminology, we have
> not clearly stated our
> objectives
>
> Ultimately the observed behavior should be identical in most respect
> to what implementers do today and in particular we don't want to
> change the mapping for a well formed program, to avoid silent change,
> and we want to keep supporting the set of source encoding supported by
> existing implementations.
Agreed.
>
> But (I think) we should
>
> - Tighten the specification to describe a semantic rather than
> implementation defined mapping, while also making sure that mappings
> prescribed by vendors, or Unicode, such as UTF-EBCDIC are
> not accidentally forbidden.

That sounds good to me. However, as we've recently learned, there are
certain implementation shenanigans that need to be accounted for:

  * gcc stripping white space following a line continuation character (I
    think you intend to adopt this behavior in general though).
  * trigraphs

> - Make the program ill-formed if the source file contains invalid code
> unit sequences or unmappable characters (the later of which is not a
> concern for existing implementations)

Existing practice is to allow ill-formed code unit sequences in portions
of the source code. For example, ill-formed UTF-8 in comments or string
literals.

Hubert has specifically requested better support for unmappable
characters, so I don't agree with the parenthetical.

>
> - Specify that unicode encoded (and especially utf-8 encoded) files
> preserve the sequence of code points in phase 1 (aka no normalization)
Agreed.
> - Mandate support for utf-8 files
I strongly want this split out to a separate paper as this is
evolutionary, not a wording fix or tightening of specification; this is
a separable concern (though one that should be kept in mind when
drafting wording for the other issues).
> - Find a better wording mechanism for the handling of raw
> string literals, which most likely means that we want to handle
> universal-character names that appear in the source files as verbatim
> escape sequences differently than what phase 1 calls extended characters.
Something along those lines, yes.
> - Not actively prevent, nor mandate, that, if the source and the
> execution or wide execution encoding are the same, the sequence of
> bytes in string literal is preserved.
I'm having trouble connecting this with the other suggestions. If phase
1 is re-specified as suggested above (e.g., semantic mapping), then I'm
not sure what this gains or prevents.
>
> Did I miss anything?
>
This is probably a separable concern, but we did discuss the possibility
of decomposed UTF-8 sequences encountered in source input getting mapped
to a composed character in string literals for non-Unicode execution
character sets (e.g., decomposed é getting mapped to composed é in
ISO-8859-1).

Tom.

Received on 2020-06-15 10:20:53