C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 10 Jun 2022 15:56:32 +0200
On 10/06/2022 15.29, Corentin wrote:
>
>
> On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 10/06/2022 10.02, Corentin via SG16 wrote:
> > I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).
>
> What exactly do you think is hard to understand?
>
> Personally, I think clearly separating the "input" side from the
> compiler-internal side (translation characters set) is helpful
> in getting the right mental model here. There is a mapping
> stage in phase 1; it might be very thin for UTF-8 input, but
> it's possibly thicker for non-UTF-8 input, so we should not
> try to hide that mapping stage.
>
>
> There is no disagreement here, but encodings and files are orthogonal concerns.
> Imagine an implementation that can read files of both disk/networks/database/etc. all of these inputs are different kind of inputs yet they all may or may not be utf-8.
> Pretending utf-8 ness is related to the kind of medium the source code comes from makes very little sense to me.

Nobody is pretending that. We just pretend everything is a "physical source file"
(nobody knows what that means, btw), and then we say there are different "kinds".

There is no need to dissect the encoding from other properties of a specific kind,
as long as there is at least one kind that uses the desired encoding (UTF-8).

> I'm fine with removing the note, but I would like to see
> the parenthetical
>
> "(introducing new-line characters for end-of-line indicators)"
>
> restored for the "any other kind" case.
> (Omitting the parenthetical feels like a regression.)
>
>
> It's not, as we reformulated that sentence.

I'm not seeing a material reformulation that would make the
parenthetical more superfluous.

Status quo:

Physical source file
characters are mapped, in an implementation-defined manner,
to the translation character set (5.3)
(introducing new-line characters for end-of-line indicators).

New text (Hubert's proposal):

For any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner,
to a sequence of translation character set elements.

> I'm happy leaving the parenthese here

Good.

> as long as we remove it as part of P2348

Making the processing of the present paper somehow dependent on the processing
of a future paper is ... not a good approach. We'll review P2348 based on its
own merits.

> > That being said, as this wording seems to have more consensus, maybe we can go with some form of it, it achieves the intent of the paper.
> >
> > ---
> > An implementation shall support source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> > other kinds of source files, and, if so, the kind of a source file is determined in an implementation-defined manner which includes a means of designating a file as a UTF-8 source file, independent of the contents of the source files. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
> > ---
>
> I think Hubert's formulation addresses the concern that we don't
> want to require that a single source file can be separately designated
> as UTF-8 (and others are different). "designating a file" sounds
> dangerously close to that.
>
>
> "which includes a means of designating source files as UTF-8 source files" then. I'm not a fan of "which includes a means of causing the determination to interpret"

Yeah, that feels a bit over-the-top.

> ===
> An implementation shall support physical source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> other kinds of physical source files, and, if so, the kind of a physical source file is determined in an implementation-defined manner which includes a means of designating physical source files as UTF-8 source files, independent of their content. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> For any other kind of physical source file supported by the implementation, characters are mapped to the translation character set (introducing new-line characters for end-of-line indicators).
> ===

Jens

Received on 2022-06-10 13:56:36