Mike
> I prefer option 2, but restoring some of the wording your tweak deleted from
Hubert's suggestion:
I remove that sentence because we specify in the next paragraph "If a source file is determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence" - I'd rather not repeat thinks multiple times.
Davis:
> What we want to mean by that is that the input file is the sequence of bytes returned from open(2) and read(2),
This is definitively not the intent. any piece of textual data can be modeled as a sequence of integers, regardless of system specifics or underlying storage mechanism. Memory, files, network resources, and even the non-computer scenario we might consider can be modeled as such. We do not want to specify how such a sequence is produced, that happens before phase 1 and we don't care, nor need to care how.
It may be "toothless" in a very hostile implementation, but we already established that. We should not spend so much time thinking of the evil ways an implementation could abuse phase 1 wording as it is not a game we can win. Nor is it today, an implementation can replace the content of the source file by nothing and claim conformance. Yet they don't.
There needs to be a reasonable starting point, otherwise we will find ourselves specifying drive firmwares.
Jens
> It seems that, with either option, a "UTF-8 source file" must use LF line endings (because that's what a "new-line" character is, arguably).
It's unclear in the current wording but that inconsistency is known, hence P2348.
Jens
> We use "designate" in one place and "determined" in another when
talking about UTF-8 source files.
Well, the designation is one method of determination but I'm happy to make that tweak.