sg16: Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 6 May 2021 16:51:34 +0200

On 06/05/2021 10.23, Corentin wrote:
> Thanks for your feedback!
> New draft https://isocpp.org/files/papers/D2295R4.pdf <https://isocpp.org/files/papers/D2295R4.pdf>
>
> On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:

> This mixes two levels of "shall"s. The first says is a requirement
> on the file, the second is a requirement on the implementation.
> Better disentangle the two.
>
> Also, I suggest to drop the second half of that sentence.
> What would we lose? There is no permission elsewhere for the
> implementation to mess with the contents of UTF-8 files,
> so better not confuse Charlie. :-)
>
>
> If we don't say that, we never say what happens to the content of the file.
> And this is an important part of the paper.

>
>
> Suggested rewrite of the entire paragraph:
>
> The encoding scheme of a physical source file is determined
> in an implementation-defined manner. An implementation shall
> support (possibly among others) the UTF-8 encoding scheme.

I still think it's better to put the "shall support" requirement
here, where the "determination" is. This also makes it possibly
to totally omit the definition of the term UTF-8 file, which makes Alisdair
happier.

> If the encoding scheme of a physical source file is determined
> to be UTF-8, the physical source file shall consist of a well-formed
> sequence of UTF-8 code units as specified by ISO/IEC 10646.
> The sequence of source file characters is the sequence of characters
> encoded by the UTF-8 code units of the physical source file.
>
>
> The term character is still completely vacuous. Worse, using the UCS definition, this puts a requirement
> that the scalar values are assigned, which is not the intent. I think it would be great to avoid using ambiguous terms where we can avoid it!

This will mostly be resolved by the merge with my "translation character set" paper,
but I think we should be clear what the basis of the following phases of translation
is.

For non-UTF-8 source files, we map to the "basic source character set", but we don't
seem to have a name for the sequence of things that comes out of that mapping.

For UTF-8 source files, we should specify what the mapping to the sequence of things
is. From then on, processing is the same for both variants.
"shall be preserved" does not seem to specify a mapping from X to Y
in a sufficiently complete manner.

Jens

Received on 2021-05-06 09:51:40