ISOCPP sg16 List: P2295 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 9 Jun 2022 16:23:24 +0200

Hello folks,
We have not talked about P2295 for a while, but given that multiple people
have signaled to me they are interested in seeing progress,
I would like to see whether we can find a majority consensus on wording.
We have 2 options to choose from, I have a very strong preference for
option 1 which is a more direct description of reality ("a kind of source
file" as suggested by option 2 is a bit too vacuous for my taste).

The last sentence of both wordings is extracted from P2348 - Whitespaces
Wording Revamp, as this avoids having to retain a note about "end of line
indicator" for the non utf-8 case, and a note saying there are no such "end
of line indicator" for the ut-8 case. The term "end of line indicator" was
never defined, and because the mapping is implementation defined, it is a
given that implementations can introduce whatever characters they like.

I tweaked option 2 slightly from what was suggested by Mike/Huber to avoid
repetition of the definition of a UTF-8 source file.

It is important to me that, in addition to achieving the design goals of
P2295, the wording remains as clear as possible.

Let me know what you think.

Regards,

Corentin

*Option 1*
A source file is a sequence of integers with an associated encoding scheme
that is determined in an implementation-defined manner.
An implementation shall support the UTF-8 encoding scheme, and may support
an implementation-defined set of additional encoding schemes.
If encoding schemes other than UTF-8 are supported, an implementation shall
provide a means by which the UTF-8 encoding scheme can be specified,
independent of the content of that source file. [Note: In other words,
recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]

If the encoding scheme of a source file is determined to be UTF-8, then the
source file shall be a well-formed UTF-8 code unit sequence. The source
file is decoded to produce a sequence of UCS scalar values that constitutes
the sequence of elements of the translation character set.

For any other encoding scheme supported by the implementation, source file
characters are mapped, in an implementation-defined manner, to a sequence
of translation character set elements.

*Option 2: *

An implementation shall support UTF-8 source files. It may also support an
implementation-defined set of other kinds of source files, and, if so, it
shall provide an implementation-defined means of designating a file as a
UTF-8 source file, independent of the content of that source file. [Note:
In other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
--end note].

If a source file is determined to be a UTF-8 source file, then it shall be
a well-formed UTF-8 code unit sequence and its content is decoded to
produce a sequence of UCS scalar values that constitutes the sequence of
elements of the translation character set.

For any other kind of source file, characters are mapped, in an
implementation-defined manner, to a sequence of translation character set
elements.

Received on 2022-06-09 14:23:36