C++ Logo

sg16

Advanced search

Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

From: Peter Brett <pbrett_at_[hidden]>
Date: Thu, 6 May 2021 08:26:42 +0000
Hi Corentin,

This looks good for me. Please consider adding explicit notation in the wording diff to indicate where paragraph breaks have been added relative to the working draft.

                    Peter

From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin via SG16
Sent: 06 May 2021 09:23
To: Jens Maurer <Jens.Maurer_at_[hidden]>
Cc: Corentin <corentin.jabot_at_[hidden]>; SG16 <sg16_at_[hidden]>
Subject: Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

EXTERNAL MAIL
Thanks for your feedback!
New draft https://isocpp.org/files/papers/D2295R4.pdf<https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2295R4.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJJk7L-xg$>

On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer_at_[hidden]<mailto:Jens.Maurer_at_[hidden]>> wrote:
On 29/04/2021 09.34, Corentin via SG16 wrote:
> Per request in yesterday's meeting,
> here is P2295R3 Support for UTF-8 as a portable source file encoding.
>
> I am looking forward to your feedback
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf<https://urldefense.com/v3/__http:/www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJFqy4_sQ$> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf<https://urldefense.com/v3/__http:/www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJFqy4_sQ$>>


- I'm not in favor of replacing "physical source file" with something else.
This is pre-existing terminology and the prose section of the paper does
not mention any issues with the term. Also, I don't remember any issues
with the specific term to have been voiced.

- The text should use "physical source file" consistently and not
abbreviate it to "source file" on occasion.

 - "is a source file encoded with ..."
Does that mean there is a source file, and it's encoded as part of
the process? Maybe "whose encoding scheme is UTF-8..."

 - "defined in ISO/IEC 10646" -> "specified by ..."

 - Do not italicize defined terms more than once (which is where
the definition of the term is).

 - "An implementation shall support UTF-8 files." Move to the end of
the paragraph.

 - "If the source file" -> "If a physical source file"

 - "If the source file is determined to be a UTF-8 file, it shall
represent a well-formed sequence of UTF-8 code units and the scalar
value of each source character shall be preserved."

This mixes two levels of "shall"s. The first says is a requirement
on the file, the second is a requirement on the implementation.
Better disentangle the two.

Also, I suggest to drop the second half of that sentence.
What would we lose? There is no permission elsewhere for the
implementation to mess with the contents of UTF-8 files,
so better not confuse Charlie. :-)

If we don't say that, we never say what happens to the content of the file.
And this is an important part of the paper.


Suggested rewrite of the entire paragraph:

The encoding scheme of a physical source file is determined
in an implementation-defined manner. An implementation shall
support (possibly among others) the UTF-8 encoding scheme.

If the encoding scheme of a physical source file is determined
to be UTF-8, the physical source file shall consist of a well-formed
sequence of UTF-8 code units as specified by ISO/IEC 10646.
The sequence of source file characters is the sequence of characters
encoded by the UTF-8 code units of the physical source file.

The term character is still completely vacuous. Worse, using the UCS definition, this puts a requirement
that the scalar values are assigned, which is not the intent. I think it would be great to avoid using ambiguous terms where we can avoid it!




 - In the next paragraph, the parenthetical about new-lines should
use commas instead (it's normatively relevant).

 - "Any source
file character not in the basic source character set is replaced by the universal-character-
name that designates that character."

is applicable to both cases, so should be in a separate paragraph.
But that's going to be fixed by my translation character set paper
anyway.


Jens

Received on 2021-05-06 03:26:51