C++ Logo


Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Wed, 22 Jun 2022 21:21:32 -0400
On Wed, Jun 22, 2022 at 8:12 AM William M. (Mike) Miller via Core <
core_at_[hidden]> wrote:

> On Wed, Jun 22, 2022 at 4:12 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>> Hey folks.
>> Updated paper here: https://isocpp.org/files/papers/D2295R6.pdf
>> I applied the changes requested and would like to take a vote at the next
>> meeting on this wording.
>> I will be honest, I hesitated dropping the paper, as I find it
>> unfortunate to conflate the encoding of text with the way the bytes are
>> stored physically, and I really do not think the standard
>> should be explicit about any one storage method specificity.
>> That being said /physical source/input/ is a nice improvement. win some,
>> lose some.
>> And ultimately, the important thing is that the intent of the paper be
>> standardized even if we can't completely agree on wording.
> I'm reasonably happy with the new wording. I'd still prefer changing
> "introducing" to "representing", or adding "if necessary" in the sentence
> about new-lines; if the input file already delimits lines with a new-line
> character, there's no "introducing" taking place. The existing wording
> appears to require that the "implementation-defined manner" must include
> deleting new-lines that appear in the input and "introducing" new-lines to
> replace them in the resulting logical source.
> With regard to the editor's note in the paper, I'm sympathetic to the
> concerns about record-oriented UTF-8 files. However, it seems to me that
> normatively supporting that category would require some changes in the
> specification of the UTF-8 case, since the current wording for UTF-8 files
> implies that there is an exact one-to-one correspondence between the code
> units of the input and the sequence of elements of the translation
> character set, which would likely not be the case for the record-oriented
> UTF-8 (because there would presumably not be new-line code units but only
> record boundaries in the input). Maybe it would be sufficient to change
> "are a sequence" to "contain a sequence" in the first paragraph and to
> change the second paragraph to "...sequence of UCS scalar values
> (introducing new-line characters, if necessary, for end-of-line indicators)
> that constitutes the sequence..."?

Implementations are free to support record-oriented source files containing
UTF-8 as being non-"UTF-8 source file"s. Changing the definition of UTF-8
source file will harm portability of source files by possibly encouraging
non-trivial transfer mechanisms.

> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/06/12839.php

Received on 2022-06-23 01:22:02