C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Thu, 9 Jun 2022 11:35:05 -0400
On Thu, Jun 9, 2022 at 11:30 AM William M. (Mike) Miller <
william.m.miller_at_[hidden]> wrote:

> On Thu, Jun 9, 2022 at 10:23 AM Corentin via Core <core_at_[hidden]>
> wrote:
>
>>
>> Hello folks,
>> We have not talked about P2295 for a while, but given that multiple
>> people have signaled to me they are interested in seeing progress,
>> I would like to see whether we can find a majority consensus on wording.
>> We have 2 options to choose from, I have a very strong preference for
>> option 1 which is a more direct description of reality ("a kind of source
>> file" as suggested by option 2 is a bit too vacuous for my taste).
>>
>> The last sentence of both wordings is extracted from P2348 - Whitespaces
>> Wording Revamp, as this avoids having to retain a note about "end of line
>> indicator" for the non utf-8 case, and a note saying there are no such "end
>> of line indicator" for the ut-8 case. The term "end of line indicator" was
>> never defined, and because the mapping is implementation defined, it is a
>> given that implementations can introduce whatever characters they like.
>>
>> I tweaked option 2 slightly from what was suggested by Mike/Huber to
>> avoid repetition of the definition of a UTF-8 source file.
>>
>> It is important to me that, in addition to achieving the design goals of
>> P2295, the wording remains as clear as possible.
>>
>> Let me know what you think.
>>
>> Regards,
>>
>> Corentin
>>
>>
>> *Option 1*
>> A source file is a sequence of integers with an associated encoding
>> scheme that is determined in an implementation-defined manner.
>> An implementation shall support the UTF-8 encoding scheme, and may
>> support an implementation-defined set of additional encoding schemes.
>> If encoding schemes other than UTF-8 are supported, an implementation
>> shall provide a means by which the UTF-8 encoding scheme can be specified,
>> independent of the content of that source file. [Note: In other words,
>> recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>>
>> If the encoding scheme of a source file is determined to be UTF-8, then
>> the source file shall be a well-formed UTF-8 code unit sequence. The source
>> file is decoded to produce a sequence of UCS scalar values that constitutes
>> the sequence of elements of the translation character set.
>>
>> For any other encoding scheme supported by the implementation, source
>> file characters are mapped, in an implementation-defined manner, to a
>> sequence of translation character set elements.
>>
>> *Option 2: *
>>
>> An implementation shall support UTF-8 source files. It may also support
>> an implementation-defined set of other kinds of source files, and, if so,
>> it shall provide an implementation-defined means of designating a file as a
>> UTF-8 source file, independent of the content of that source file. [Note:
>> In other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
>> --end note].
>>
>> If a source file is determined to be a UTF-8 source file, then it shall
>> be a well-formed UTF-8 code unit sequence and its content is decoded to
>> produce a sequence of UCS scalar values that constitutes the sequence of
>> elements of the translation character set.
>>
>> For any other kind of source file, characters are mapped, in an
>> implementation-defined manner, to a sequence of translation character set
>> elements.
>>
>
> I prefer option 2, but restoring some of the wording your tweak deleted
> from Hubert's suggestion <http://lists.isocpp.org/core/2022/03/12140.php>:
>

Sorry, I picked up the link from Hubert's citation rather than the link to
the message itself. The correct reference is:

    https://lists.isocpp.org/core/2022/03/12143.php

-- 
William M. (Mike) Miller | Edison Design Group
william.m.miller_at_[hidden]

Received on 2022-06-09 15:35:17