sg16: Re: [SG16] Wording for P2295 based on P2314

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 14 Jun 2021 14:22:20 +0200

Proposed wording

The encoding scheme of a physical source file is determined in an
implementation-
defined manner. An implementation shall provide a mechanism to determine the
encoding of a source file that is independent of its content.

An implementation shall support the UTF-8 encoding scheme. The set of
additional
encodings supported by an implementation is implementation-defined.

If the encoding scheme of a physical source file is determined to be UTF-8,
then the
physical source file shall be a well-formed UTF-8 sequence representing
elements of the
translation character set.

For any other encoding scheme supported by the implementation, P physical
source
file characters are mapped, in an implementation-defined manner, to the
translation
character set (introducing new-line characters for end-of-line indicators).
The set of
physical source file characters accepted is implementation-defined.

An implementation may use any internal encoding, so long as an actual
extended
character encountered in the source file, and the same extended character
expressed in
the source file as a universal-character-name (e.g., using the \uXXXX
notation), are handled
equivalently except where this replacement is reverted in a raw string
literal.

2. If the first character is U+FEFF BYTE ORDER MARK, it is deleted. [...]

On Mon, Jun 14, 2021 at 12:53 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Mon, Jun 14, 2021 at 12:44 PM Peter Brett <pbrett_at_[hidden]> wrote:
>
>> Hi Corentin,
>>
>>
>>
>> Thank you for all the helpful feedback!
>>
>>
>>
>> In the wording, I am attempting to draw a distinction between “what is
>> the encoding scheme associated with the file” and “what does the file
>> actually contain.” As an analogy, a C++ source file is still a C++ source
>> file even when it contains syntax errors that prevent it compiling; for
>> example, deleting the last ‘}’ in a C++ source file doesn’t stop it from
>> being a C++ source file. As another analogy, the encoding scheme associated
>> with a string literal is the literal encoding, but the string literal does
>> not actually have to be valid with respect to the that encoding scheme.
>>
>>
>>
>> I think we need to explicitly state that there is a way for a user to
>> tell the compiler that a source file is UTF-8. This is to make sure that
>> implementations cannot have, “I’ll look at the file contents and guess,” as
>> the only mechanism for determining the encoding. Several SG-16 participants
>> have said that it is absolutely essential to have a way to tell the
>> compiler, “No, I am totally convinced that I am giving you UTF-8 and I want
>> you to produce an error if it isn’t.”
>>
>
> How about saying that then?
>
> The encoding scheme of a source file is determined in an
> implementation-defined manner. An implementation shall provide a mechanism
> to determine the encoding of a source file that is independent of its
> content.
>
>
>>
>> I’m going to tweak the wording to say that we ‘associate’ an encoding
>> with the source file.
>>
>>
>>
>> I’m then attempting to say that UTF-8 source files actually have to
>> contain UTF-8, and also that there is absolutely no “mapping” involved; the
>> contents of the source files is already ready for phase 2 (i.e. it is *
>> *already** in the translation character set).
>>
>>
>>
>> Finally, I’ve left the wording w.r.t. “anything else” completely
>> unchanged, so that it remains clear that implementations don’t have to
>> change the EBCDIC/ISO-8859-1/Big5 path through phase 1 after this paper is
>> applied.
>>
>>
>>
>> I agree that this wording definitely contains more words than necessary
>> and could eventually go on a diet, but I’m currently trying to be very
>> clear rather than concise. I don’t mind using as much repetition and/or
>> redundancy as necessary in order to be unambiguous.
>>
>>
>>
>> Here’s a new proposed wording based on P2314, and I hope you think it is
>> an improvement:
>>
>>
>>
>> 1. An encoding scheme is associated with a physical source file in an
>> implementation defined manner. An implementation shall support the UTF-8
>> encoding scheme. An implementation shall define a mechanism for specifying
>> that UTF-8 is the encoding scheme associated with a physical source file.
>>
>>
>>
>> If a physical source file’s associated encoding scheme is UTF-8, then it
>> shall be a well-formed sequence of translation character set elements
>> encoded as UTF-8 code units. [ *Note 1*: The result of phase 1 is the
>> exact sequence of UCS scalar values present in the file, with no
>> substitutions, modifications or corrections. — *end note*]
>>
>>
>>
>> If a physical source file’s associated encoding scheme is not UTF-8, then
>> physical source file characters are mapped, in an implementation-defined
>> manner, to the translation character set (introducing new-line characters
>> for end-of-line indicators). The set of physical source file characters
>> accepted is implementation-defined.
>>
>>
>>
>> 1. If the first character is U+FEFF BYTE ORDER MARK, it is deleted.
>> ...
>>
>>
>>
>> I’m not sure we can cut this down without introducing ambiguities or
>> removing important elements. If you’re still unhappy with this, then I
>> guess we’re stuck. Maybe someone else can have a go.
>>
>>
>>
>> Best wishes,
>>
>>
>>
>> Peter
>>
>>
>>
>>
>>
>>
>>
>> *From:* Corentin <corentin.jabot_at_[hidden]>
>> *Sent:* 14 June 2021 11:11
>> *To:* Peter Brett <pbrett_at_[hidden]>
>> *Cc:* SG16 <sg16_at_[hidden]>
>> *Subject:* Re: Wording for P2295 based on P2314
>>
>>
>>
>> On Tue, Jun 8, 2021 at 6:49 PM Peter Brett <pbrett_at_[hidden]> wrote:
>>
>> Hi Corentin,
>>
>> In our most recent meeting on 2021-05-26, you were asked to reword
>> his unpublished D2295R4 "Support for UTF-8 as a portable source file
>> encoding" based on the most recent revision of P2314 "Character sets and
>> encodings" (currently R2).
>>
>> [lex.phases] as modified by P2314:
>>
>> > 1. Physical source file characters are mapped, in an
>> > implementation-defined manner, to the translation character set
>> > (introducing new-line characters for end-of-line indicators). The
>> > set of physical source file characters accepted is
>> > implementation-defined.
>>
>> [lex.charset] as modified by P2314:
>>
>> > 1. The translation character set consists of the following elements:
>> >
>> > - each character named by ISO/IEC 10646, as identified by its unique
>> > UCS scalar value, and
>> > - a distinct character for each UCS scalar value where no named
>> > character is assigned
>>
>> As I understand it, the design intent for P2295 is as follows:
>>
>> - UTF-8 source files shall be supported
>>
>> - Users shall be able to specify that source files are to be assumed to
>> be UTF-8 encoded.
>>
>> - Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
>> content shall be ill-formed.
>>
>> - The contents of UTF-8 source files shall be transmitted to phase 2 of
>> translation verbatim. There's no implementation freedom to mess with
>> it.
>>
>> My suggested approach for [lex.phases] is as follows. Let's take
>> advantage of the fact that P2314 defines the translation character set
>> as *exactly* the set of UCS scalar values to completely elide the
>> mapping step from phase 1 of translation when processing UTF-8 source
>> files.
>>
>> 1. The encoding scheme of a physical source file is determined in an
>> implementation-defined manner. An implementation shall support
>> the UTF-8 encoding scheme. An implementation shall define a
>> mechanism for specifying that UTF-8 is the encoding scheme for a
>> physical source file.
>>
>> If the encoding scheme of a physical source file is UTF-8, then
>> it shall be a well-formed sequence of translation character set
>> elements encoded as UTF-8 code units.
>>
>>
>>
>> At the very least this should be "If the encoding scheme of a physical
>> source file is *DETERMINED TO BE* UTF-8.
>>
>> Not sure the rest makes sense as it just redefines UTF-8.
>>
>> Thank you for not using the term character though :)
>>
>>
>>
>> I am still unclear as to whether this wording is sufficient to prevent an
>> implementation to do rewrite.
>>
>> I will trust you that it is.
>>
>>
>>
>>
>> If the encoding scheme of a physical source file is not UTF-8,
>> then physical source file characters are mapped, in an
>> implementation-defined manner, to the translation character set
>> (introducing new-line characters for end-of-line indicators).
>> The set of physical source file characters accepted is
>> implementation-defined.
>>
>>
>>
>> That last sentence doesn't mean anything.
>>
>> We need to keep something along the line of "An implementation shall
>> support the UTF-8 encoding scheme. The set of additional encoding schemes
>> is implementation defined."
>>
>> Or "The set of encoding schemes supported by the implementation is
>> implementation defined. but shall contain UTF-8". Or something like that.
>>
>>
>>
>>
>> 2. If the first character is U+FEFF BYTE ORDER MARK, it is
>> deleted. ...
>>
>> What do you think?
>>
>> Best regards,
>>
>> Peter
>>
>>

Received on 2021-06-14 07:22:35