C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Thu, 9 Jun 2022 11:30:46 -0400
On Thu, Jun 9, 2022 at 10:23 AM Corentin via Core <core_at_[hidden]>
wrote:

>
> Hello folks,
> We have not talked about P2295 for a while, but given that multiple people
> have signaled to me they are interested in seeing progress,
> I would like to see whether we can find a majority consensus on wording.
> We have 2 options to choose from, I have a very strong preference for
> option 1 which is a more direct description of reality ("a kind of source
> file" as suggested by option 2 is a bit too vacuous for my taste).
>
> The last sentence of both wordings is extracted from P2348 - Whitespaces
> Wording Revamp, as this avoids having to retain a note about "end of line
> indicator" for the non utf-8 case, and a note saying there are no such "end
> of line indicator" for the ut-8 case. The term "end of line indicator" was
> never defined, and because the mapping is implementation defined, it is a
> given that implementations can introduce whatever characters they like.
>
> I tweaked option 2 slightly from what was suggested by Mike/Huber to avoid
> repetition of the definition of a UTF-8 source file.
>
> It is important to me that, in addition to achieving the design goals of
> P2295, the wording remains as clear as possible.
>
> Let me know what you think.
>
> Regards,
>
> Corentin
>
>
> *Option 1*
> A source file is a sequence of integers with an associated encoding scheme
> that is determined in an implementation-defined manner.
> An implementation shall support the UTF-8 encoding scheme, and may support
> an implementation-defined set of additional encoding schemes.
> If encoding schemes other than UTF-8 are supported, an implementation
> shall provide a means by which the UTF-8 encoding scheme can be specified,
> independent of the content of that source file. [Note: In other words,
> recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If the encoding scheme of a source file is determined to be UTF-8, then
> the source file shall be a well-formed UTF-8 code unit sequence. The source
> file is decoded to produce a sequence of UCS scalar values that constitutes
> the sequence of elements of the translation character set.
>
> For any other encoding scheme supported by the implementation, source file
> characters are mapped, in an implementation-defined manner, to a sequence
> of translation character set elements.
>
> *Option 2: *
>
> An implementation shall support UTF-8 source files. It may also support an
> implementation-defined set of other kinds of source files, and, if so, it
> shall provide an implementation-defined means of designating a file as a
> UTF-8 source file, independent of the content of that source file. [Note:
> In other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
> --end note].
>
> If a source file is determined to be a UTF-8 source file, then it shall be
> a well-formed UTF-8 code unit sequence and its content is decoded to
> produce a sequence of UCS scalar values that constitutes the sequence of
> elements of the translation character set.
>
> For any other kind of source file, characters are mapped, in an
> implementation-defined manner, to a sequence of translation character set
> elements.
>

I prefer option 2, but restoring some of the wording your tweak deleted
from Hubert's suggestion <http://lists.isocpp.org/core/2022/03/12140.php>:

    An implementation shall support physical source files that are a
sequence of UTF-8 code units.

If you want to be able to refer to such files as "UTF-8 source files," you
could add that term as a definition in that sentence:

    ...UTF-8 code units, called a *UTF-8 source file*.

I still feel that option 1 is overly specific in requiring all input files
to be sequences of integers. It's not something we need to specify, so we
shouldn't.

-- 
William M. (Mike) Miller | Edison Design Group
william.m.miller_at_[hidden]

Received on 2022-06-09 15:30:58