C++ Logo


Advanced search

Subject: Re: P2295R3 Support for UTF-8 as a portable source file encoding
From: Peter Brett (pbrett_at_[hidden])
Date: 2021-05-06 03:26:42

Hi Corentin,

This looks good for me. Please consider adding explicit notation in the wording diff to indicate where paragraph breaks have been added relative to the working draft.


From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin via SG16
Sent: 06 May 2021 09:23
To: Jens Maurer <Jens.Maurer_at_[hidden]>
Cc: Corentin <corentin.jabot_at_[hidden]>; SG16 <sg16_at_[hidden]>
Subject: Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

Thanks for your feedback!
New draft https://isocpp.org/files/papers/D2295R4.pdf$>

On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer_at_[hidden]<mailto:Jens.Maurer_at_[hidden]>> wrote:
On 29/04/2021 09.34, Corentin via SG16 wrote:
> Per request in yesterday's meeting,
> here is P2295R3 Support for UTF-8 as a portable source file encoding.
> I am looking forward to your feedback
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf$> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf$>>

- I'm not in favor of replacing "physical source file" with something else.
This is pre-existing terminology and the prose section of the paper does
not mention any issues with the term. Also, I don't remember any issues
with the specific term to have been voiced.

- The text should use "physical source file" consistently and not
abbreviate it to "source file" on occasion.

 - "is a source file encoded with ..."
Does that mean there is a source file, and it's encoded as part of
the process? Maybe "whose encoding scheme is UTF-8..."

 - "defined in ISO/IEC 10646" -> "specified by ..."

 - Do not italicize defined terms more than once (which is where
the definition of the term is).

 - "An implementation shall support UTF-8 files." Move to the end of
the paragraph.

 - "If the source file" -> "If a physical source file"

 - "If the source file is determined to be a UTF-8 file, it shall
represent a well-formed sequence of UTF-8 code units and the scalar
value of each source character shall be preserved."

This mixes two levels of "shall"s. The first says is a requirement
on the file, the second is a requirement on the implementation.
Better disentangle the two.

Also, I suggest to drop the second half of that sentence.
What would we lose? There is no permission elsewhere for the
implementation to mess with the contents of UTF-8 files,
so better not confuse Charlie. :-)

If we don't say that, we never say what happens to the content of the file.
And this is an important part of the paper.

Suggested rewrite of the entire paragraph:

The encoding scheme of a physical source file is determined
in an implementation-defined manner. An implementation shall
support (possibly among others) the UTF-8 encoding scheme.

If the encoding scheme of a physical source file is determined
to be UTF-8, the physical source file shall consist of a well-formed
sequence of UTF-8 code units as specified by ISO/IEC 10646.
The sequence of source file characters is the sequence of characters
encoded by the UTF-8 code units of the physical source file.

The term character is still completely vacuous. Worse, using the UCS definition, this puts a requirement
that the scalar values are assigned, which is not the intent. I think it would be great to avoid using ambiguous terms where we can avoid it!

 - In the next paragraph, the parenthetical about new-lines should
use commas instead (it's normatively relevant).

 - "Any source
file character not in the basic source character set is replaced by the universal-character-
name that designates that character."

is applicable to both cases, so should be in a separate paragraph.
But that's going to be fixed by my translation character set paper


SG16 list run by sg16-owner@lists.isocpp.org