C++ Logo


Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Wed, 22 Jun 2022 21:36:19 -0400
On Wed, Jun 22, 2022 at 9:22 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Wed, Jun 22, 2022 at 8:12 AM William M. (Mike) Miller via Core <
> core_at_[hidden]> wrote:
>> On Wed, Jun 22, 2022 at 4:12 AM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>> Hey folks.
>>> Updated paper here: https://isocpp.org/files/papers/D2295R6.pdf
>>> I applied the changes requested and would like to take a vote at the
>>> next meeting on this wording.
>>> I will be honest, I hesitated dropping the paper, as I find it
>>> unfortunate to conflate the encoding of text with the way the bytes are
>>> stored physically, and I really do not think the standard
>>> should be explicit about any one storage method specificity.
>>> That being said /physical source/input/ is a nice improvement. win some,
>>> lose some.
>>> And ultimately, the important thing is that the intent of the paper be
>>> standardized even if we can't completely agree on wording.
>> I'm reasonably happy with the new wording. I'd still prefer changing
>> "introducing" to "representing", or adding "if necessary" in the sentence
>> about new-lines; if the input file already delimits lines with a new-line
>> character, there's no "introducing" taking place. The existing wording
>> appears to require that the "implementation-defined manner" must include
>> deleting new-lines that appear in the input and "introducing" new-lines to
>> replace them in the resulting logical source.
>> With regard to the editor's note in the paper, I'm sympathetic to the
>> concerns about record-oriented UTF-8 files. However, it seems to me that
>> normatively supporting that category would require some changes in the
>> specification of the UTF-8 case, since the current wording for UTF-8 files
>> implies that there is an exact one-to-one correspondence between the code
>> units of the input and the sequence of elements of the translation
>> character set, which would likely not be the case for the record-oriented
>> UTF-8 (because there would presumably not be new-line code units but only
>> record boundaries in the input). Maybe it would be sufficient to change
>> "are a sequence" to "contain a sequence" in the first paragraph and to
>> change the second paragraph to "...sequence of UCS scalar values
>> (introducing new-line characters, if necessary, for end-of-line indicators)
>> that constitutes the sequence..."?
> Implementations are free to support record-oriented source files
> containing UTF-8 as being non-"UTF-8 source file"s. Changing the definition
> of UTF-8 source file will harm portability of source files by possibly
> encouraging non-trivial transfer mechanisms.

I'm somewhat sympathetic to Corentin's concern that lumping non-stream
UTF-8-encoded files into the "implementation-defined" category circumvents
some of the things we'd like to say about UTF-8-encoded content, but I
don't feel strongly about it and am happy enough with the proposal as it

William M. (Mike) Miller | Edison Design Group

Received on 2022-06-23 01:36:31