On Sun, Jul 11, 2021 at 9:09 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Sun, Jul 11, 2021 at 12:56 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

In the third paragraph of phase 1:
[ ... ], then the physical source file shall be a well-formed UTF-8 sequence.
Each UCS scalar value encoded in the UTF-8 sequence is mapped to the corresponding element of the translation character set.

Just to clarify: I am suggesting the above for the wording (it was not merely a quote providing context for the later comment). This version separates the diagnostic requirement from the description of the processing.

I purposefully avoided the term mapping here. because the set of source characters and the set of translation set characters are the same there is no need to specify a mapping.


I'm not sure what to make of the situation around end-of-line indicators yet. P2348, "Whitespaces Wording Revamp", is also floating in the mix.

Indeed. P2348 is motivated by P2295.
I believe it's not dramatic to leave things partially hanging (there can be line feed in utf-8 files and we do say that line feed is new-line in P2314), but I hope we will talk about P2348 at some point.

For the UTF-8 case, I think a note to the effect that "there are no end-of-line indicators apart from the content of the UTF-8 sequence" could help (at least to further the discussion). For P2348, I suggest that "out-of-band" end-of-line indicators should remain accepted.

Suggestion for P2348:
The physical source file is mapped, in an implementation-defined manner, to a sequence of basic source character set elements.