I've merged the suggestions (add "physical", use the parenthetical for the non-UTF-8 case, use plural form for designating, have wider-scope implementation-defined wording for non-UTF-8 case that encompasses the permission from the parenthetical):

An implementation shall support physical source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of other kinds of physical source files, and, if so, the kind of a physical source file is determined in an implementation-defined manner, which includes a means of designating physical source files as UTF-8 source files, independent of their content. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]

If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set. For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements (introducing new-line characters for end-of-line indicators).

On Fri, Jun 10, 2022 at 10:42 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:

On Fri, Jun 10, 2022 at 4:02 AM Corentin <corentin.jabot@gmail.com> wrote:
I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).
It's also very repetitive but maybe we can massage that a bit.
Lastly, I really don't like the " There are no end-of-line indicators apart from the content of the UTF-8 code unit sequence" which is more confusing than enlightening.

This is extremely relevant if you consider that "text" being a sequence of characters without structure is not the only way you can look at text.

It's also unfortunate that the utf-8-ness is tied to a medium rather than the content, and that we can't agree that source code is text, or that any textual data consumed by an implementation has an associated encoding.

What we do not seem to agree on is whether or not "text" can be taken as structured by lines and the such.

I truly am trying to convey the intent of the paper through to places where certain assumptions about the nature of text files do not match the native ones. If the wording does not include hooks to point out that certain paradigms are not meant to extend into the world of portable, UTF-8 source code, then we'll likely end up with "UTF-8 source code" that isn't portable. It would not be caused by "hostility" from any party, merely a failure of the wording to clarify the intent.

I guess the wording does cover that intent now. An UTF-8 source file is defined purely to be the sequence.