C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Daniela Engert <dani_at_[hidden]>
Date: Fri, 10 Jun 2022 07:03:23 +0200
Am 09.06.2022 um 18:46 schrieb Jens Maurer via Core:
> It seems that, with either option, a "UTF-8 source file" must use LF
> line endings (because that's what a "new-line" character is, arguably).
>
> Is that common for Windows UTF-8 environments (as opposed to using CR/LF)?
To answer your question: no, it is not. By default (i.e. written out by
editors), end-of-lines are designated by a CR+LF pair, just like it
always has been. The whole end-of-line business is unaffected by the
mode an environment (you can have many different ones at the same time)
is told to operate in.
>
> Otherwise, I think we need to keep some permission for mapping end-of-line
> indicators even in the UTF-8 case.
>
> Jens
>
>
>
> On 09/06/2022 16.23, Corentin via Core wrote:
>> Hello folks,
>> We have not talked about P2295 for a while, but given that multiple people have signaled to me they are interested in seeing progress,
>> I would like to see whether we can find a majority consensus on wording.
>> We have 2 options to choose from, I have a very strong preference for option 1 which is a more direct description of reality ("a kind of source file" as suggested by option 2 is a bit too vacuous for my taste).
>>
>> The last sentence of both wordings is extracted from P2348 - Whitespaces Wording Revamp, as this avoids having to retain a note about "end of line indicator" for the non utf-8 case, and a note saying there are no such "end of line indicator" for the ut-8 case. The term "end of line indicator" was never defined, and because the mapping is implementation defined, it is a given that implementations can introduce whatever characters they like.
>>
>> I tweaked option 2 slightly from what was suggested by Mike/Huber to avoid repetition of the definition of a UTF-8 source file.
>>
>> It is important to me that, in addition to achieving the design goals of P2295, the wording remains as clear as possible.
>>
>> Let me know what you think.
>>
>> Regards,
>>
>> Corentin
>>
>> _Option 1
>> _
>> A source file is a sequence of integers with an associated encoding scheme that is determined in an implementation-defined manner.
>> An implementation shall support the UTF-8 encoding scheme, and may support an implementation-defined set of additional encoding schemes.
>> If encoding schemes other than UTF-8 are supported, an implementation shall provide a means by which the UTF-8 encoding scheme can be specified, independent of the content of that source file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>>
>> If the encoding scheme of a source file is determined to be UTF-8, then the source file shall be a well-formed UTF-8 code unit sequence. The source file is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
>>
>> For any other encoding scheme supported by the implementation, source file characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
>>
>> _Option 2: _
>>
>> An implementation shall support UTF-8 source files. It may also support an implementation-defined set of other kinds of source files, and, if so, it shall provide an implementation-defined means of designating a file as a UTF-8 source file, independent of the content of that source file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note].
>>
>> If a source file is determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and its content is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
>>
>> For any other kind of source file, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2022/06/12669.php
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/06/12677.php

Received on 2022-06-10 05:03:28