Mike

> I prefer option 2, but restoring some of the wording your tweak deleted from Hubert's suggestion:

I remove that sentence because we specify in the next paragraph "If a source file is determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence" - I'd rather not repeat thinks multiple times.

Davis:

> What we want to mean by that is that the input file is the sequence of bytes returned from open(2) and read(2),

This is definitively not the intent. any piece of textual data can be modeled as a sequence of integers, regardless of system specifics or underlying storage mechanism. Memory, files, network resources, and even the non-computer scenario we might consider can be modeled as such. We do not want to specify how such a sequence is produced, that happens before phase 1 and we don't care, nor need to care how.

It may be "toothless" in a very hostile implementation, but we already established that. We should not spend so much time thinking of the evil ways an implementation could abuse phase 1 wording as it is not a game we can win. Nor is it today, an implementation can replace the content of the source file by nothing and claim conformance. Yet they don't.

There needs to be a reasonable starting point, otherwise we will find ourselves specifying drive firmwares.

Jens

> It seems that, with either option, a "UTF-8 source file" must use LF line endings (because that's what a "new-line" character is, arguably).

It's unclear in the current wording but that inconsistency is known, hence P2348.

Jens

> We use "designate" in one place and "determined" in another when

talking about UTF-8 source files.

Well, the designation is one method of determination but I'm happy to make that tweak.

On Thu, Jun 9, 2022 at 6:50 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 09/06/2022 16.23, Corentin via Core wrote:
> _Option 2: _
>
> An implementation shall support UTF-8 source files. It may also support an implementation-defined set of other kinds of source files, and, if so, it shall provide an implementation-defined means of designating a file as a UTF-8 source file, independent of the content of that source file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note].
>
> If a source file is determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and its content is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.

We use "designate" in one place and "determined" in another when
talking about UTF-8 source files.

What's the operative difference between those words?
If there is some, I'd appreciate making the difference
clearer, e.g. by saying

"is designated or otherwise determined to be a UTF-8 source file..."

Jens