C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 9 Jun 2022 18:46:15 +0200
It seems that, with either option, a "UTF-8 source file" must use LF
line endings (because that's what a "new-line" character is, arguably).

Is that common for Windows UTF-8 environments (as opposed to using CR/LF)?

Otherwise, I think we need to keep some permission for mapping end-of-line
indicators even in the UTF-8 case.

Jens



On 09/06/2022 16.23, Corentin via Core wrote:
>
> Hello folks,
> We have not talked about P2295 for a while, but given that multiple people have signaled to me they are interested in seeing progress,
> I would like to see whether we can find a majority consensus on wording.
> We have 2 options to choose from, I have a very strong preference for option 1 which is a more direct description of reality ("a kind of source file" as suggested by option 2 is a bit too vacuous for my taste).
>
> The last sentence of both wordings is extracted from P2348 - Whitespaces Wording Revamp, as this avoids having to retain a note about "end of line indicator" for the non utf-8 case, and a note saying there are no such "end of line indicator" for the ut-8 case. The term "end of line indicator" was never defined, and because the mapping is implementation defined, it is a given that implementations can introduce whatever characters they like.
>
> I tweaked option 2 slightly from what was suggested by Mike/Huber to avoid repetition of the definition of a UTF-8 source file.
>
> It is important to me that, in addition to achieving the design goals of P2295, the wording remains as clear as possible.
>
> Let me know what you think.
>
> Regards,
>
> Corentin
>
> _Option 1
> _
> A source file is a sequence of integers with an associated encoding scheme that is determined in an implementation-defined manner.
> An implementation shall support the UTF-8 encoding scheme, and may support an implementation-defined set of additional encoding schemes.
> If encoding schemes other than UTF-8 are supported, an implementation shall provide a means by which the UTF-8 encoding scheme can be specified, independent of the content of that source file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If the encoding scheme of a source file is determined to be UTF-8, then the source file shall be a well-formed UTF-8 code unit sequence. The source file is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
>
> For any other encoding scheme supported by the implementation, source file characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
>
> _Option 2: _
>
> An implementation shall support UTF-8 source files. It may also support an implementation-defined set of other kinds of source files, and, if so, it shall provide an implementation-defined means of designating a file as a UTF-8 source file, independent of the content of that source file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note].
>
> If a source file is determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and its content is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
>
> For any other kind of source file, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/06/12669.php

Received on 2022-06-09 16:46:25