> The encoding scheme of a physical source file is determined in an implementation-defined manner. 
> An implementation shall define a mechanism for specifying that UTF-8 is the encoding scheme for a physical source file.

Can we possibly find a way to not say that twice?
Implementation-defined mechanism already implies there is a documented mechanism and I don't think we can protect against implementations for which that mechanism is arcane.

On Mon, Jun 14, 2021 at 12:11 PM Corentin <corentin.jabot@gmail.com> wrote:


On Tue, Jun 8, 2021 at 6:49 PM Peter Brett <pbrett@cadence.com> wrote:
Hi Corentin,

In our most recent meeting on 2021-05-26, you were asked to reword
his unpublished D2295R4 "Support for UTF-8 as a portable source file
encoding" based on the most recent revision of P2314 "Character sets and
encodings" (currently R2).

[lex.phases] as modified by P2314:

> 1. Physical source file characters are mapped, in an
>    implementation-defined manner, to the translation character set
>    (introducing new-line characters for end-of-line indicators).  The
>    set of physical source file characters accepted is
>    implementation-defined.

[lex.charset] as modified by P2314:

> 1. The translation character set consists of the following elements:
>
>    - each character named by ISO/IEC 10646, as identified by its unique
>      UCS scalar value, and
>    - a distinct character for each UCS scalar value where no named
>      character is assigned

As I understand it, the design intent for P2295 is as follows:

- UTF-8 source files shall be supported

- Users shall be able to specify that source files are to be assumed to
  be UTF-8 encoded.

- Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
  content shall be ill-formed.

- The contents of UTF-8 source files shall be transmitted to phase 2 of
  translation verbatim.  There's no implementation freedom to mess with
  it.

My suggested approach for [lex.phases] is as follows.  Let's take
advantage of the fact that P2314 defines the translation character set
as *exactly* the set of UCS scalar values to completely elide the
mapping step from phase 1 of translation when processing UTF-8 source
files.

    1. The encoding scheme of a physical source file is determined in an
       implementation-defined manner.  An implementation shall support
       the UTF-8 encoding scheme.  An implementation shall define a
       mechanism for specifying that UTF-8 is the encoding scheme for a
       physical source file.

       If the encoding scheme of a physical source file is UTF-8, then
       it shall be a well-formed sequence of translation character set
       elements encoded as UTF-8 code units.

At the very least this should be "If the encoding scheme of a physical source file is *DETERMINED TO BE* UTF-8.
Not sure the rest makes sense as it just redefines UTF-8. 
Thank you for not using the term character though :)

I am still unclear as to whether this wording is sufficient to prevent an implementation to do rewrite.
I will trust you that it is.
 

       If the encoding scheme of a physical source file is not UTF-8,
       then physical source file characters are mapped, in an
       implementation-defined manner, to the translation character set
       (introducing new-line characters for end-of-line indicators).
       The set of physical source file characters accepted is
       implementation-defined.

That last sentence doesn't mean anything.
We need to keep something along the line of "An implementation shall support the UTF-8 encoding scheme. The set of additional encoding schemes is implementation defined."
Or  "The set of encoding schemes supported by the implementation is implementation defined. but shall contain UTF-8". Or something like that.
 

    2. If the first character is U+FEFF BYTE ORDER MARK, it is
       deleted. ...

What do you think?

Best regards,

                        Peter