C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 9 Jun 2022 20:02:35 +0200
Mike
> I prefer option 2, but restoring some of the wording your tweak deleted
from Hubert's suggestion <http://lists.isocpp.org/core/2022/03/12140.php>:

I remove that sentence because we specify in the next paragraph "If a
source file is determined to be a UTF-8 source file, then it shall be a
well-formed UTF-8 code unit sequence" - I'd rather not repeat thinks
multiple times.

Davis:
> What we want to mean by that is that the input file is the sequence of
bytes returned from open(2) and read(2),

This is definitively not the intent. any piece of textual data can be
modeled as a sequence of integers, regardless of system specifics or
underlying storage mechanism. Memory, files, network resources, and even
the non-computer scenario we might consider can be modeled as such. We do
not want to specify how such a sequence is produced, that happens before
phase 1 and we don't care, nor need to care how.
It may be "toothless" in a very hostile implementation, but we already
established that. We should not spend so much time thinking of the evil
ways an implementation could abuse phase 1 wording as it is not a game we
can win. Nor is it today, an implementation can replace the content of the
source file by nothing and claim conformance. Yet they don't.
There needs to be a reasonable starting point, otherwise we will find
ourselves specifying drive firmwares.

Jens
> It seems that, with either option, a "UTF-8 source file" must use LF line
endings (because that's what a "new-line" character is, arguably).

It's unclear in the current wording but that inconsistency is known,
hence P2348.

Jens
> We use "designate" in one place and "determined" in another when
talking about UTF-8 source files.

Well, the designation is one method of determination but I'm happy to make
that tweak.


On Thu, Jun 9, 2022 at 6:50 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 09/06/2022 16.23, Corentin via Core wrote:
> > _Option 2: _
> >
> > An implementation shall support UTF-8 source files. It may also support
> an implementation-defined set of other kinds of source files, and, if so,
> it shall provide an implementation-defined means of designating a file as a
> UTF-8 source file, independent of the content of that source file. [Note:
> In other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
> --end note].
> >
> > If a source file is determined to be a UTF-8 source file, then it shall
> be a well-formed UTF-8 code unit sequence and its content is decoded to
> produce a sequence of UCS scalar values that constitutes the sequence of
> elements of the translation character set.
>
> We use "designate" in one place and "determined" in another when
> talking about UTF-8 source files.
>
> What's the operative difference between those words?
> If there is some, I'd appreciate making the difference
> clearer, e.g. by saying
>
> "is designated or otherwise determined to be a UTF-8 source file..."
>
> Jens
>
>

Received on 2022-06-09 18:02:46