On Fri, Jun 10, 2022 at 4:02 AM Corentin via Core <core@lists.isocpp.org> wrote:

I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).

I don't see what would be hard to understand here. If anything, I think it's easier to understand than introducing the concept that a source file consists of a sequence of integers; most people think of files as a sequence of characters, and it requires a mental reset to introduce the concept that "characters" are actually numbers so you can talk about encoding.

It's also very repetitive but maybe we can massage that a bit.
Lastly, I really don't like the " There are no end-of-line indicators apart from the content of the UTF-8 code unit sequence" which is more confusing than enlightening.
It's also unfortunate that the utf-8-ness is tied to a medium rather than the content, and that we can't agree that source code is text, or that any textual data consumed by an implementation has an associated encoding.

It's not that we can't agree on those things; it's more that we can't agree that the standard should require those things. We can leave those details in the implementation-defined permissivity. As I see it, the intent of this change is to require implementations to support input that is 1) a physical source file that is 2) encoded as UTF-8. Requiring anything about input that does not satisfy those two criteria is unnecessary.

I'd also prefer using "input" in lieu of "physical source files", as we established physical source files may not be files nor be physical.

That being said, as this wording seems to have more consensus, maybe we can go with some form of it, it achieves the intent of the paper.

---
An implementation shall support

Need "physical" here, both to express what I think are the essential criteria I mentioned and to match the restriction to "physical" source files in the next paragraph.

source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
other kinds of source files, and, if so, the kind of a source file is determined in an implementation-defined manner which includes a means of designating a file as a UTF-8 source file, independent of the contents of the source files. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]

If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
---

Apart from that change, I'm fine with this formulation.