C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Wed, 22 Jun 2022 08:11:44 -0400
On Wed, Jun 22, 2022 at 4:12 AM Corentin <corentin.jabot_at_[hidden]> wrote:

> Hey folks.
>
> Updated paper here: https://isocpp.org/files/papers/D2295R6.pdf
> I applied the changes requested and would like to take a vote at the next
> meeting on this wording.
>
> I will be honest, I hesitated dropping the paper, as I find it unfortunate
> to conflate the encoding of text with the way the bytes are stored
> physically, and I really do not think the standard
> should be explicit about any one storage method specificity.
>
> That being said /physical source/input/ is a nice improvement. win some,
> lose some.
> And ultimately, the important thing is that the intent of the paper be
> standardized even if we can't completely agree on wording.
>

I'm reasonably happy with the new wording. I'd still prefer changing
"introducing" to "representing", or adding "if necessary" in the sentence
about new-lines; if the input file already delimits lines with a new-line
character, there's no "introducing" taking place. The existing wording
appears to require that the "implementation-defined manner" must include
deleting new-lines that appear in the input and "introducing" new-lines to
replace them in the resulting logical source.

With regard to the editor's note in the paper, I'm sympathetic to the
concerns about record-oriented UTF-8 files. However, it seems to me that
normatively supporting that category would require some changes in the
specification of the UTF-8 case, since the current wording for UTF-8 files
implies that there is an exact one-to-one correspondence between the code
units of the input and the sequence of elements of the translation
character set, which would likely not be the case for the record-oriented
UTF-8 (because there would presumably not be new-line code units but only
record boundaries in the input). Maybe it would be sufficient to change
"are a sequence" to "contain a sequence" in the first paragraph and to
change the second paragraph to "...sequence of UCS scalar values
(introducing new-line characters, if necessary, for end-of-line indicators)
that constitutes the sequence..."?

Received on 2022-06-22 12:11:56