In our most recent meeting on 2021-05-26, you were asked to reword
his unpublished D2295R4 "Support for UTF-8 as a portable source file
encoding" based on the most recent revision of P2314 "Character sets and
encodings" (currently R2).
[lex.phases] as modified by P2314:
> 1. Physical source file characters are mapped, in an
> implementation-defined manner, to the translation character set
> (introducing new-line characters for end-of-line indicators). The
> set of physical source file characters accepted is
[lex.charset] as modified by P2314:
> 1. The translation character set consists of the following elements:
> - each character named by ISO/IEC 10646, as identified by its unique
> UCS scalar value, and
> - a distinct character for each UCS scalar value where no named
> character is assigned
As I understand it, the design intent for P2295 is as follows:
- UTF-8 source files shall be supported
- Users shall be able to specify that source files are to be assumed to
be UTF-8 encoded.
- Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
content shall be ill-formed.
- The contents of UTF-8 source files shall be transmitted to phase 2 of
translation verbatim. There's no implementation freedom to mess with
My suggested approach for [lex.phases] is as follows. Let's take
advantage of the fact that P2314 defines the translation character set
as *exactly* the set of UCS scalar values to completely elide the
mapping step from phase 1 of translation when processing UTF-8 source
1. The encoding scheme of a physical source file is determined in an
implementation-defined manner. An implementation shall support
the UTF-8 encoding scheme. An implementation shall define a
mechanism for specifying that UTF-8 is the encoding scheme for a
physical source file.
If the encoding scheme of a physical source file is UTF-8, then
it shall be a well-formed sequence of translation character set
elements encoded as UTF-8 code units.
At the very least this should be "If the encoding scheme of a physical source file is *DETERMINED TO BE* UTF-8.
Not sure the rest makes sense as it just redefines UTF-8.
Thank you for not using the term character though :)
I am still unclear as to whether this wording is sufficient to prevent an implementation to do rewrite.
I will trust you that it is.
If the encoding scheme of a physical source file is not UTF-8,
then physical source file characters are mapped, in an
implementation-defined manner, to the translation character set
(introducing new-line characters for end-of-line indicators).
The set of physical source file characters accepted is
That last sentence doesn't mean anything.
We need to keep something along the line of "An implementation shall support the UTF-8 encoding scheme. The set of additional encoding schemes is implementation defined."
Or "The set of encoding schemes supported by the implementation is implementation defined. but shall contain UTF-8". Or something like that.
2. If the first character is U+FEFF BYTE ORDER MARK, it is
What do you think?