C++ Logo

sg16

Advanced search

Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 6 May 2021 10:48:09 +0200
On Thu, May 6, 2021 at 10:26 AM Peter Brett <pbrett_at_[hidden]> wrote:

> Hi Corentin,
>
>
>
> This looks good for me. Please consider adding explicit notation in the
> wording diff to indicate where paragraph breaks have been added relative to
> the working draft.
>
Done


> Peter
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Corentin via
> SG16
> *Sent:* 06 May 2021 09:23
> *To:* Jens Maurer <Jens.Maurer_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>; SG16 <sg16_at_[hidden]>
> *Subject:* Re: [SG16] P2295R3 Support for UTF-8 as a portable source file
> encoding
>
>
>
> EXTERNAL MAIL
>
> Thanks for your feedback!
>
> New draft https://isocpp.org/files/papers/D2295R4.pdf
> <https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2295R4.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJJk7L-xg$>
>
>
>
> On Fri, Apr 30, 2021 at 9:07 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
> On 29/04/2021 09.34, Corentin via SG16 wrote:
> > Per request in yesterday's meeting,
> > here is P2295R3 Support for UTF-8 as a portable source file encoding.
> >
> > I am looking forward to your feedback
> >
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf
> <https://urldefense.com/v3/__http:/www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJFqy4_sQ$>
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf
> <https://urldefense.com/v3/__http:/www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf__;!!EHscmS1ygiU1lA!S4MYha0UyJAi5QW3-5eLYfTZJOXIp7t9FT3Qw15EjHhhPj-k3GX9ZjJFqy4_sQ$>
> >
>
>
> - I'm not in favor of replacing "physical source file" with something else.
> This is pre-existing terminology and the prose section of the paper does
> not mention any issues with the term. Also, I don't remember any issues
> with the specific term to have been voiced.
>
> - The text should use "physical source file" consistently and not
> abbreviate it to "source file" on occasion.
>
> - "is a source file encoded with ..."
> Does that mean there is a source file, and it's encoded as part of
> the process? Maybe "whose encoding scheme is UTF-8..."
>
> - "defined in ISO/IEC 10646" -> "specified by ..."
>
> - Do not italicize defined terms more than once (which is where
> the definition of the term is).
>
> - "An implementation shall support UTF-8 files." Move to the end of
> the paragraph.
>
> - "If the source file" -> "If a physical source file"
>
> - "If the source file is determined to be a UTF-8 file, it shall
> represent a well-formed sequence of UTF-8 code units and the scalar
> value of each source character shall be preserved."
>
> This mixes two levels of "shall"s. The first says is a requirement
> on the file, the second is a requirement on the implementation.
> Better disentangle the two.
>
> Also, I suggest to drop the second half of that sentence.
> What would we lose? There is no permission elsewhere for the
> implementation to mess with the contents of UTF-8 files,
> so better not confuse Charlie. :-)
>
>
>
> If we don't say that, we never say what happens to the content of the file.
>
> And this is an important part of the paper.
>
>
>
> Suggested rewrite of the entire paragraph:
>
> The encoding scheme of a physical source file is determined
> in an implementation-defined manner. An implementation shall
> support (possibly among others) the UTF-8 encoding scheme.
>
> If the encoding scheme of a physical source file is determined
> to be UTF-8, the physical source file shall consist of a well-formed
> sequence of UTF-8 code units as specified by ISO/IEC 10646.
> The sequence of source file characters is the sequence of characters
> encoded by the UTF-8 code units of the physical source file.
>
>
>
> The term character is still completely vacuous. Worse, using the UCS
> definition, this puts a requirement
>
> that the scalar values are assigned, which is not the intent. I think it
> would be great to avoid using ambiguous terms where we can avoid it!
>
>
>
>
>
>
>
> - In the next paragraph, the parenthetical about new-lines should
> use commas instead (it's normatively relevant).
>
> - "Any source
> file character not in the basic source character set is replaced by the
> universal-character-
> name that designates that character."
>
> is applicable to both cases, so should be in a separate paragraph.
> But that's going to be fixed by my translation character set paper
> anyway.
>
>
> Jens
>
>

Received on 2021-05-06 03:48:22