Hi Corentin,

Thank you for all the helpful feedback!

In the wording, I am attempting to draw a distinction between “what is the encoding scheme associated with the file” and “what does the file actually contain.” As an analogy, a C++ source file is still a C++ source file even when it contains syntax errors that prevent it compiling; for example, deleting the last ‘}’ in a C++ source file doesn’t stop it from being a C++ source file. As another analogy, the encoding scheme associated with a string literal is the literal encoding, but the string literal does not actually have to be valid with respect to the that encoding scheme.

I think we need to explicitly state that there is a way for a user to tell the compiler that a source file is UTF-8. This is to make sure that implementations cannot have, “I’ll look at the file contents and guess,” as the only mechanism for determining the encoding. Several SG-16 participants have said that it is absolutely essential to have a way to tell the compiler, “No, I am totally convinced that I am giving you UTF-8 and I want you to produce an error if it isn’t.”

I’m going to tweak the wording to say that we ‘associate’ an encoding with the source file.

I’m then attempting to say that UTF-8 source files actually have to contain UTF-8, and also that there is absolutely no “mapping” involved; the contents of the source files is already ready for phase 2 (i.e. it is *already* in the translation character set).

Finally, I’ve left the wording w.r.t. “anything else” completely unchanged, so that it remains clear that implementations don’t have to change the EBCDIC/ISO-8859-1/Big5 path through phase 1 after this paper is applied.

I agree that this wording definitely contains more words than necessary and could eventually go on a diet, but I’m currently trying to be very clear rather than concise. I don’t mind using as much repetition and/or redundancy as necessary in order to be unambiguous.

Here’s a new proposed wording based on P2314, and I hope you think it is an improvement:

An encoding scheme is associated with a physical source file in an implementation defined manner. An implementation shall support the UTF-8 encoding scheme. An implementation shall define a mechanism for specifying that UTF-8 is the encoding scheme associated with a physical source file.

If a physical source file’s associated encoding scheme is UTF-8, then it shall be a well-formed sequence of translation character set elements encoded as UTF-8 code units. [ Note 1: The result of phase 1 is the exact sequence of UCS scalar values present in the file, with no substitutions, modifications or corrections. — end note]

If a physical source file’s associated encoding scheme is not UTF-8, then physical source file characters are mapped, in an implementation-defined manner, to the translation character set (introducing new-line characters for end-of-line indicators). The set of physical source file characters accepted is implementation-defined.

If the first character is U+FEFF BYTE ORDER MARK, it is deleted. ...

I’m not sure we can cut this down without introducing ambiguities or removing important elements. If you’re still unhappy with this, then I guess we’re stuck. Maybe someone else can have a go.

Best wishes,

Peter

From: Corentin <corentin.jabot@gmail.com>
Sent: 14 June 2021 11:11
To: Peter Brett <pbrett@cadence.com>
Cc: SG16 <sg16@lists.isocpp.org>
Subject: Re: Wording for P2295 based on P2314

On Tue, Jun 8, 2021 at 6:49 PM Peter Brett <pbrett@cadence.com> wrote:

Hi Corentin,

In our most recent meeting on 2021-05-26, you were asked to reword
his unpublished D2295R4 "Support for UTF-8 as a portable source file
encoding" based on the most recent revision of P2314 "Character sets and
encodings" (currently R2).

[lex.phases] as modified by P2314:

> 1. Physical source file characters are mapped, in an
> implementation-defined manner, to the translation character set
> (introducing new-line characters for end-of-line indicators). The
> set of physical source file characters accepted is
> implementation-defined.

[lex.charset] as modified by P2314:

> 1. The translation character set consists of the following elements:
>
> - each character named by ISO/IEC 10646, as identified by its unique
> UCS scalar value, and
> - a distinct character for each UCS scalar value where no named
> character is assigned

As I understand it, the design intent for P2295 is as follows:

- UTF-8 source files shall be supported

- Users shall be able to specify that source files are to be assumed to
be UTF-8 encoded.

- Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
content shall be ill-formed.

- The contents of UTF-8 source files shall be transmitted to phase 2 of
translation verbatim. There's no implementation freedom to mess with
it.

My suggested approach for [lex.phases] is as follows. Let's take
advantage of the fact that P2314 defines the translation character set
as *exactly* the set of UCS scalar values to completely elide the
mapping step from phase 1 of translation when processing UTF-8 source
files.

1. The encoding scheme of a physical source file is determined in an
implementation-defined manner. An implementation shall support
the UTF-8 encoding scheme. An implementation shall define a
mechanism for specifying that UTF-8 is the encoding scheme for a
physical source file.

If the encoding scheme of a physical source file is UTF-8, then
it shall be a well-formed sequence of translation character set
elements encoded as UTF-8 code units.

At the very least this should be "If the encoding scheme of a physical source file is *DETERMINED TO BE* UTF-8.

Not sure the rest makes sense as it just redefines UTF-8.

Thank you for not using the term character though :)

I am still unclear as to whether this wording is sufficient to prevent an implementation to do rewrite.

I will trust you that it is.

If the encoding scheme of a physical source file is not UTF-8,
then physical source file characters are mapped, in an
implementation-defined manner, to the translation character set
(introducing new-line characters for end-of-line indicators).
The set of physical source file characters accepted is
implementation-defined.

That last sentence doesn't mean anything.

We need to keep something along the line of "An implementation shall support the UTF-8 encoding scheme. The set of additional encoding schemes is implementation defined."

Or "The set of encoding schemes supported by the implementation is implementation defined. but shall contain UTF-8". Or something like that.

2. If the first character is U+FEFF BYTE ORDER MARK, it is
deleted. ...

What do you think?

Best regards,

Peter