C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 9 Jun 2022 19:07:19 -0400
I also prefer the direction of Option 2. I share Mike's concerns that
"UTF-8 source file" comes out of nowhere in Option 2 as presented.

Additionally, I previously gave feedback that there was no requirement out
of SG16 for there to be an ability to individually designate files as UTF-8
source files (as opposed to having a mode where all source files are
considered UTF-8 source files).

Corentin, if you have concrete objections to the following, please express
them:
An implementation shall support physical source files that are a sequence
of UTF-8 code units. It may also support an implementation-defined set of
other kinds of source files, and, if so, the kind of a source file is
determined in an implementation-defined manner which includes a means of
causing the determination to interpret files as sequences of UTF-8 code
units, independent of the contents of the source files. [Note: In other
words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]

If a physical source file is determined to consist of a sequence of UTF-8
code units, then it shall be a well-formed UTF-8 code unit sequence. The
source file is decoded to produce a sequence of UCS scalar values that
constitutes the sequence of elements of the translation character set. [
Note: There are no end-of-line indicators apart from the content of the
UTF-8 code unit sequence. — end note ]

For any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner, to a sequence
of translation character set elements.

On Thu, Jun 9, 2022 at 10:23 AM Corentin via Core <core_at_[hidden]>
wrote:

>
> Hello folks,
> We have not talked about P2295 for a while, but given that multiple people
> have signaled to me they are interested in seeing progress,
> I would like to see whether we can find a majority consensus on wording.
> We have 2 options to choose from, I have a very strong preference for
> option 1 which is a more direct description of reality ("a kind of source
> file" as suggested by option 2 is a bit too vacuous for my taste).
>
> The last sentence of both wordings is extracted from P2348 - Whitespaces
> Wording Revamp, as this avoids having to retain a note about "end of line
> indicator" for the non utf-8 case, and a note saying there are no such "end
> of line indicator" for the ut-8 case. The term "end of line indicator" was
> never defined, and because the mapping is implementation defined, it is a
> given that implementations can introduce whatever characters they like.
>
> I tweaked option 2 slightly from what was suggested by Mike/Huber to avoid
> repetition of the definition of a UTF-8 source file.
>
> It is important to me that, in addition to achieving the design goals of
> P2295, the wording remains as clear as possible.
>
> Let me know what you think.
>
> Regards,
>
> Corentin
>
>
> *Option 1*
> A source file is a sequence of integers with an associated encoding scheme
> that is determined in an implementation-defined manner.
> An implementation shall support the UTF-8 encoding scheme, and may support
> an implementation-defined set of additional encoding schemes.
> If encoding schemes other than UTF-8 are supported, an implementation
> shall provide a means by which the UTF-8 encoding scheme can be specified,
> independent of the content of that source file. [Note: In other words,
> recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If the encoding scheme of a source file is determined to be UTF-8, then
> the source file shall be a well-formed UTF-8 code unit sequence. The source
> file is decoded to produce a sequence of UCS scalar values that constitutes
> the sequence of elements of the translation character set.
>
> For any other encoding scheme supported by the implementation, source file
> characters are mapped, in an implementation-defined manner, to a sequence
> of translation character set elements.
>
> *Option 2: *
>
> An implementation shall support UTF-8 source files. It may also support an
> implementation-defined set of other kinds of source files, and, if so, it
> shall provide an implementation-defined means of designating a file as a
> UTF-8 source file, independent of the content of that source file. [Note:
> In other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
> --end note].
>
> If a source file is determined to be a UTF-8 source file, then it shall be
> a well-formed UTF-8 code unit sequence and its content is decoded to
> produce a sequence of UCS scalar values that constitutes the sequence of
> elements of the translation character set.
>
> For any other kind of source file, characters are mapped, in an
> implementation-defined manner, to a sequence of translation character set
> elements.
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/06/12669.php
>

Received on 2022-06-09 23:07:48