C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: William M. (Mike) Miller <"William>
Date: Fri, 10 Jun 2022 09:04:28 -0400
On Fri, Jun 10, 2022 at 4:02 AM Corentin via Core <core_at_[hidden]>
wrote:

> I'm concerned that this approach will be hard to understand by people who
> have not followed the discussions, on top of preexisting obfuscations (the
> translation set indirection).
>

I don't see what would be hard to understand here. If anything, I think
it's easier to understand than introducing the concept that a source file
consists of a sequence of integers; most people think of files as a
sequence of characters, and it requires a mental reset to introduce the
concept that "characters" are actually numbers so you can talk about
encoding.


> It's also very repetitive but maybe we can massage that a bit.
> Lastly, I really don't like the " There are no end-of-line indicators
> apart from the content of the UTF-8 code unit sequence" which is more
> confusing than enlightening.
> It's also unfortunate that the utf-8-ness is tied to a medium rather than
> the content, and that we can't agree that source code is text, or that any
> textual data consumed by an implementation has an associated encoding.
>

It's not that we can't agree on those things; it's more that we can't agree
that the standard should require those things. We can leave those details
in the implementation-defined permissivity. As I see it, the intent of this
change is to require implementations to support input that is 1) a physical
source file that is 2) encoded as UTF-8. Requiring anything about input
that does not satisfy those two criteria is unnecessary.


> I'd also prefer using "input" in lieu of "physical source files", as we
> established physical source files may not be files nor be physical.
>
> That being said, as this wording seems to have more consensus, maybe we
> can go with some form of it, it achieves the intent of the paper.
>
> ---
> An implementation shall support
>

Need "physical" here, both to express what I think are the essential
criteria I mentioned and to match the restriction to "physical" source
files in the next paragraph.


> source files that are a sequence of UTF-8 code units (UTF-8 source files).
> It may also support an implementation-defined set of
> other kinds of source files, and, if so, the kind of a source file is
> determined in an implementation-defined manner which includes a means of
> designating a file as a UTF-8 source file, independent of the contents of
> the source files. [Note: In other words, recognizing the U+FEFF Byte Order
> Mark is not sufficient. --end note]
>
> If a physical source file is designated or otherwise determined to be a
> UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
> and it is decoded to produce a sequence of UCS scalar values that
> constitutes the sequence of elements of the translation character set.
> For any other kind of physical source file supported by the
> implementation, characters are mapped, in an implementation-defined manner,
> to a sequence of translation character set elements.
> ---
>

Apart from that change, I'm fine with this formulation.

Received on 2022-06-10 13:04:40