C++ Logo

SG16

Advanced search

Subject: Re: P2295R3 Support for UTF-8 as a portable source file encoding
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2021-05-11 23:26:52


On Thu, May 6, 2021 at 10:51 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 06/05/2021 10.23, Corentin wrote:
> > Thanks for your feedback!
> > New draft https://isocpp.org/files/papers/D2295R4.pdf <
> https://isocpp.org/files/papers/D2295R4.pdf>
> >
> > On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> > This mixes two levels of "shall"s. The first says is a requirement
> > on the file, the second is a requirement on the implementation.
> > Better disentangle the two.
> >
> > Also, I suggest to drop the second half of that sentence.
> > What would we lose? There is no permission elsewhere for the
> > implementation to mess with the contents of UTF-8 files,
> > so better not confuse Charlie. :-)
> >
> >
> > If we don't say that, we never say what happens to the content of the
> file.
> > And this is an important part of the paper.
>
> >
> >
> > Suggested rewrite of the entire paragraph:
> >
> > The encoding scheme of a physical source file is determined
> > in an implementation-defined manner. An implementation shall
> > support (possibly among others) the UTF-8 encoding scheme.
>
> I still think it's better to put the "shall support" requirement
> here, where the "determination" is. This also makes it possibly
> to totally omit the definition of the term UTF-8 file, which makes Alisdair
> happier.
>

I am also going to be happier without the definition. The definition
remains problematic for the reason I stated before: further references to
the term applies to any file that is coincidentally a well-formed sequence
of UTF-8 code units.

Also, the wording could be missing the mark on requiring implementations to
be capable of accepting UTF-8 source files "whether or not they begin with
a U+FEFF byte order mark", or more generally "without required modification
of the source file".

Suggestion:
An implementation shall provide for processing physical source files as
having a UTF-8 encoding scheme without restriction, other than resource
limits ([implimits]), upon the content of the physical source file.

This has the property that it must be possible to get the compiler to give
you a diagnostic for malformed UTF-8 instead of having the compiler simply
decide the file wasn't UTF-8 after all (and process without producing the
diagnostic).

>
> > If the encoding scheme of a physical source file is determined
> > to be UTF-8, the physical source file shall consist of a well-formed
> > sequence of UTF-8 code units as specified by ISO/IEC 10646.
> > The sequence of source file characters is the sequence of characters
> > encoded by the UTF-8 code units of the physical source file.
> >
> >
> > The term character is still completely vacuous. Worse, using the UCS
> definition, this puts a requirement
> > that the scalar values are assigned, which is not the intent. I think it
> would be great to avoid using ambiguous terms where we can avoid it!
>
> This will mostly be resolved by the merge with my "translation character
> set" paper,
> but I think we should be clear what the basis of the following phases of
> translation
> is.
>
> For non-UTF-8 source files, we map to the "basic source character set",
> but we don't
> seem to have a name for the sequence of things that comes out of that
> mapping.
>
> For UTF-8 source files, we should specify what the mapping to the sequence
> of things
> is. From then on, processing is the same for both variants.
> "shall be preserved" does not seem to specify a mapping from X to Y
> in a sufficiently complete manner.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>



SG16 list run by sg16-owner@lists.isocpp.org