On Thu, May 6, 2021 at 10:51 AM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

On 06/05/2021 10.23, Corentin wrote:
> Thanks for your feedback!
> New draft https://isocpp.org/files/papers/D2295R4.pdf <https://isocpp.org/files/papers/D2295R4.pdf>
>
> On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:

> This mixes two levels of "shall"s. The first says is a requirement
> on the file, the second is a requirement on the implementation.
> Better disentangle the two.
>
> Also, I suggest to drop the second half of that sentence.
> What would we lose? There is no permission elsewhere for the
> implementation to mess with the contents of UTF-8 files,
> so better not confuse Charlie. :-)
>
>
> If we don't say that, we never say what happens to the content of the file.
> And this is an important part of the paper.

>
>
> Suggested rewrite of the entire paragraph:
>
> The encoding scheme of a physical source file is determined
> in an implementation-defined manner. An implementation shall
> support (possibly among others) the UTF-8 encoding scheme.

I still think it's better to put the "shall support" requirement
here, where the "determination" is. This also makes it possibly
to totally omit the definition of the term UTF-8 file, which makes Alisdair
happier.

I am also going to be happier without the definition. The definition remains problematic for the reason I stated before: further references to the term applies to any file that is coincidentally a well-formed sequence of UTF-8 code units.

Also, the wording could be missing the mark on requiring implementations to be capable of accepting UTF-8 source files "whether or not they begin with a U+FEFF byte order mark", or more generally "without required modification of the source file".

Suggestion:

An implementation shall provide for processing physical source files as having a UTF-8 encoding scheme without restriction, other than resource limits ([implimits]), upon the content of the physical source file.

This has the property that it must be possible to get the compiler to give you a diagnostic for malformed UTF-8 instead of having the compiler simply decide the file wasn't UTF-8 after all (and process without producing the diagnostic).

> If the encoding scheme of a physical source file is determined
> to be UTF-8, the physical source file shall consist of a well-formed
> sequence of UTF-8 code units as specified by ISO/IEC 10646.
> The sequence of source file characters is the sequence of characters
> encoded by the UTF-8 code units of the physical source file.
>
>
> The term character is still completely vacuous. Worse, using the UCS definition, this puts a requirement
> that the scalar values are assigned, which is not the intent. I think it would be great to avoid using ambiguous terms where we can avoid it!

This will mostly be resolved by the merge with my "translation character set" paper,
but I think we should be clear what the basis of the following phases of translation
is.

For non-UTF-8 source files, we map to the "basic source character set", but we don't
seem to have a name for the sequence of things that comes out of that mapping.

For UTF-8 source files, we should specify what the mapping to the sequence of things
is. From then on, processing is the same for both variants.
"shall be preserved" does not seem to specify a mapping from X to Y
in a sufficiently complete manner.

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16