C++ Logo


Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Wed, 22 Jun 2022 08:11:44 -0400
On Wed, Jun 22, 2022 at 4:12 AM Corentin <corentin.jabot_at_[hidden]> wrote:

> Hey folks.
> Updated paper here: https://isocpp.org/files/papers/D2295R6.pdf
> I applied the changes requested and would like to take a vote at the next
> meeting on this wording.
> I will be honest, I hesitated dropping the paper, as I find it unfortunate
> to conflate the encoding of text with the way the bytes are stored
> physically, and I really do not think the standard
> should be explicit about any one storage method specificity.
> That being said /physical source/input/ is a nice improvement. win some,
> lose some.
> And ultimately, the important thing is that the intent of the paper be
> standardized even if we can't completely agree on wording.

I'm reasonably happy with the new wording. I'd still prefer changing
"introducing" to "representing", or adding "if necessary" in the sentence
about new-lines; if the input file already delimits lines with a new-line
character, there's no "introducing" taking place. The existing wording
appears to require that the "implementation-defined manner" must include
deleting new-lines that appear in the input and "introducing" new-lines to
replace them in the resulting logical source.

With regard to the editor's note in the paper, I'm sympathetic to the
concerns about record-oriented UTF-8 files. However, it seems to me that
normatively supporting that category would require some changes in the
specification of the UTF-8 case, since the current wording for UTF-8 files
implies that there is an exact one-to-one correspondence between the code
units of the input and the sequence of elements of the translation
character set, which would likely not be the case for the record-oriented
UTF-8 (because there would presumably not be new-line code units but only
record boundaries in the input). Maybe it would be sufficient to change
"are a sequence" to "contain a sequence" in the first paragraph and to
change the second paragraph to "...sequence of UCS scalar values
(introducing new-line characters, if necessary, for end-of-line indicators)
that constitutes the sequence..."?

Received on 2022-06-22 12:11:56