C++ Logo

SG16

Advanced search

Subject: Re: Wording for P2295 based on P2314
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-06-14 05:53:59


On Mon, Jun 14, 2021 at 12:44 PM Peter Brett <pbrett_at_[hidden]> wrote:

> Hi Corentin,
>
>
>
> Thank you for all the helpful feedback!
>
>
>
> In the wording, I am attempting to draw a distinction between “what is the
> encoding scheme associated with the file” and “what does the file actually
> contain.” As an analogy, a C++ source file is still a C++ source file even
> when it contains syntax errors that prevent it compiling; for example,
> deleting the last ‘}’ in a C++ source file doesn’t stop it from being a C++
> source file. As another analogy, the encoding scheme associated with a
> string literal is the literal encoding, but the string literal does not
> actually have to be valid with respect to the that encoding scheme.
>
>
>
> I think we need to explicitly state that there is a way for a user to tell
> the compiler that a source file is UTF-8. This is to make sure that
> implementations cannot have, “I’ll look at the file contents and guess,” as
> the only mechanism for determining the encoding. Several SG-16 participants
> have said that it is absolutely essential to have a way to tell the
> compiler, “No, I am totally convinced that I am giving you UTF-8 and I want
> you to produce an error if it isn’t.”
>

How about saying that then?

The encoding scheme of a source file is determined in an
implementation-defined manner. An implementation shall provide a mechanism
to determine the encoding of a source file that is independent of its
content.

>
> I’m going to tweak the wording to say that we ‘associate’ an encoding with
> the source file.
>
>
>
> I’m then attempting to say that UTF-8 source files actually have to
> contain UTF-8, and also that there is absolutely no “mapping” involved; the
> contents of the source files is already ready for phase 2 (i.e. it is *
> *already** in the translation character set).
>
>
>
> Finally, I’ve left the wording w.r.t. “anything else” completely
> unchanged, so that it remains clear that implementations don’t have to
> change the EBCDIC/ISO-8859-1/Big5 path through phase 1 after this paper is
> applied.
>
>
>
> I agree that this wording definitely contains more words than necessary
> and could eventually go on a diet, but I’m currently trying to be very
> clear rather than concise. I don’t mind using as much repetition and/or
> redundancy as necessary in order to be unambiguous.
>
>
>
> Here’s a new proposed wording based on P2314, and I hope you think it is
> an improvement:
>
>
>
> 1. An encoding scheme is associated with a physical source file in an
> implementation defined manner. An implementation shall support the UTF-8
> encoding scheme. An implementation shall define a mechanism for specifying
> that UTF-8 is the encoding scheme associated with a physical source file.
>
>
>
> If a physical source file’s associated encoding scheme is UTF-8, then it
> shall be a well-formed sequence of translation character set elements
> encoded as UTF-8 code units. [ *Note 1*: The result of phase 1 is the
> exact sequence of UCS scalar values present in the file, with no
> substitutions, modifications or corrections. — *end note*]
>
>
>
> If a physical source file’s associated encoding scheme is not UTF-8, then
> physical source file characters are mapped, in an implementation-defined
> manner, to the translation character set (introducing new-line characters
> for end-of-line indicators). The set of physical source file characters
> accepted is implementation-defined.
>
>
>
> 1. If the first character is U+FEFF BYTE ORDER MARK, it is deleted. ...
>
>
>
> I’m not sure we can cut this down without introducing ambiguities or
> removing important elements. If you’re still unhappy with this, then I
> guess we’re stuck. Maybe someone else can have a go.
>
>
>
> Best wishes,
>
>
>
> Peter
>
>
>
>
>
>
>
> *From:* Corentin <corentin.jabot_at_[hidden]>
> *Sent:* 14 June 2021 11:11
> *To:* Peter Brett <pbrett_at_[hidden]>
> *Cc:* SG16 <sg16_at_[hidden]>
> *Subject:* Re: Wording for P2295 based on P2314
>
>
>
> On Tue, Jun 8, 2021 at 6:49 PM Peter Brett <pbrett_at_[hidden]> wrote:
>
> Hi Corentin,
>
> In our most recent meeting on 2021-05-26, you were asked to reword
> his unpublished D2295R4 "Support for UTF-8 as a portable source file
> encoding" based on the most recent revision of P2314 "Character sets and
> encodings" (currently R2).
>
> [lex.phases] as modified by P2314:
>
> > 1. Physical source file characters are mapped, in an
> > implementation-defined manner, to the translation character set
> > (introducing new-line characters for end-of-line indicators). The
> > set of physical source file characters accepted is
> > implementation-defined.
>
> [lex.charset] as modified by P2314:
>
> > 1. The translation character set consists of the following elements:
> >
> > - each character named by ISO/IEC 10646, as identified by its unique
> > UCS scalar value, and
> > - a distinct character for each UCS scalar value where no named
> > character is assigned
>
> As I understand it, the design intent for P2295 is as follows:
>
> - UTF-8 source files shall be supported
>
> - Users shall be able to specify that source files are to be assumed to
> be UTF-8 encoded.
>
> - Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
> content shall be ill-formed.
>
> - The contents of UTF-8 source files shall be transmitted to phase 2 of
> translation verbatim. There's no implementation freedom to mess with
> it.
>
> My suggested approach for [lex.phases] is as follows. Let's take
> advantage of the fact that P2314 defines the translation character set
> as *exactly* the set of UCS scalar values to completely elide the
> mapping step from phase 1 of translation when processing UTF-8 source
> files.
>
> 1. The encoding scheme of a physical source file is determined in an
> implementation-defined manner. An implementation shall support
> the UTF-8 encoding scheme. An implementation shall define a
> mechanism for specifying that UTF-8 is the encoding scheme for a
> physical source file.
>
> If the encoding scheme of a physical source file is UTF-8, then
> it shall be a well-formed sequence of translation character set
> elements encoded as UTF-8 code units.
>
>
>
> At the very least this should be "If the encoding scheme of a physical
> source file is *DETERMINED TO BE* UTF-8.
>
> Not sure the rest makes sense as it just redefines UTF-8.
>
> Thank you for not using the term character though :)
>
>
>
> I am still unclear as to whether this wording is sufficient to prevent an
> implementation to do rewrite.
>
> I will trust you that it is.
>
>
>
>
> If the encoding scheme of a physical source file is not UTF-8,
> then physical source file characters are mapped, in an
> implementation-defined manner, to the translation character set
> (introducing new-line characters for end-of-line indicators).
> The set of physical source file characters accepted is
> implementation-defined.
>
>
>
> That last sentence doesn't mean anything.
>
> We need to keep something along the line of "An implementation shall
> support the UTF-8 encoding scheme. The set of additional encoding schemes
> is implementation defined."
>
> Or "The set of encoding schemes supported by the implementation is
> implementation defined. but shall contain UTF-8". Or something like that.
>
>
>
>
> 2. If the first character is U+FEFF BYTE ORDER MARK, it is
> deleted. ...
>
> What do you think?
>
> Best regards,
>
> Peter
>
>



SG16 list run by sg16-owner@lists.isocpp.org