Subject: Wording for P2295 based on P2314
From: Peter Brett (pbrett_at_[hidden])
Date: 2021-06-08 11:49:30
In our most recent meeting on 2021-05-26, you were asked to reword
his unpublished D2295R4 "Support for UTF-8 as a portable source file
encoding" based on the most recent revision of P2314 "Character sets and
encodings" (currently R2).
[lex.phases] as modified by P2314:
> 1. Physical source file characters are mapped, in an
> implementation-defined manner, to the translation character set
> (introducing new-line characters for end-of-line indicators). The
> set of physical source file characters accepted is
[lex.charset] as modified by P2314:
> 1. The translation character set consists of the following elements:
> - each character named by ISO/IEC 10646, as identified by its unique
> UCS scalar value, and
> - a distinct character for each UCS scalar value where no named
> character is assigned
As I understand it, the design intent for P2295 is as follows:
- UTF-8 source files shall be supported
- Users shall be able to specify that source files are to be assumed to
be UTF-8 encoded.
- Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
content shall be ill-formed.
- The contents of UTF-8 source files shall be transmitted to phase 2 of
translation verbatim. There's no implementation freedom to mess with
My suggested approach for [lex.phases] is as follows. Let's take
advantage of the fact that P2314 defines the translation character set
as *exactly* the set of UCS scalar values to completely elide the
mapping step from phase 1 of translation when processing UTF-8 source
1. The encoding scheme of a physical source file is determined in an
implementation-defined manner. An implementation shall support
the UTF-8 encoding scheme. An implementation shall define a
mechanism for specifying that UTF-8 is the encoding scheme for a
physical source file.
If the encoding scheme of a physical source file is UTF-8, then
it shall be a well-formed sequence of translation character set
elements encoded as UTF-8 code units.
If the encoding scheme of a physical source file is not UTF-8,
then physical source file characters are mapped, in an
implementation-defined manner, to the translation character set
(introducing new-line characters for end-of-line indicators).
The set of physical source file characters accepted is
2. If the first character is U+FEFF BYTE ORDER MARK, it is
What do you think?
SG16 list run by firstname.lastname@example.org