On Fri, Jun 10, 2022 at 4:20 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:

On Fri, Jun 10, 2022 at 9:56 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 10/06/2022 15.29, Corentin wrote:
>
>
> On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>
> On 10/06/2022 10.02, Corentin via SG16 wrote:
> > I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).
>
> What exactly do you think is hard to understand?
>
> Personally, I think clearly separating the "input" side from the
> compiler-internal side (translation characters set) is helpful
> in getting the right mental model here. There is a mapping
> stage in phase 1; it might be very thin for UTF-8 input, but
> it's possibly thicker for non-UTF-8 input, so we should not
> try to hide that mapping stage.
>
>
> There is no disagreement here, but encodings and files are orthogonal concerns.
> Imagine an implementation that can read files of both disk/networks/database/etc. all of these inputs are different kind of inputs yet they all may or may not be utf-8.
> Pretending utf-8 ness is related to the kind of medium the source code comes from makes very little sense to me.

Nobody is pretending that. We just pretend everything is a "physical source file"
(nobody knows what that means, btw), and then we say there are different "kinds".

There is no need to dissect the encoding from other properties of a specific kind,
as long as there is at least one kind that uses the desired encoding (UTF-8).

> I'm fine with removing the note, but I would like to see
> the parenthetical
>
> "(introducing new-line characters for end-of-line indicators)"
>
> restored for the "any other kind" case.
> (Omitting the parenthetical feels like a regression.)
>
>
> It's not, as we reformulated that sentence.

I'm not seeing a material reformulation that would make the
parenthetical more superfluous.

Status quo:

Physical source file
characters are mapped, in an implementation-defined manner,
to the translation character set (5.3)
(introducing new-line characters for end-of-line indicators).

New text (Hubert's proposal):

For any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner,
to a sequence of translation character set elements.

@Corentin @Jens Maurer, the key reason why my version of the text does not need the parenthetical is that the implementation-definedness encompasses the creation of the sequence from the aggregate input "characters" (which can include more than what the character is, but also where the character is). Versions of the wording that skew towards considering only the characters and what they are is undesired without the parenthetical.

> I'm happy leaving the parenthese here

Good.

> as long as we remove it as part of P2348

Making the processing of the present paper somehow dependent on the processing
of a future paper is ... not a good approach. We'll review P2348 based on its
own merits.

> > That being said, as this wording seems to have more consensus, maybe we can go with some form of it, it achieves the intent of the paper.
> >
> > ---
> > An implementation shall support source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> > other kinds of source files, and, if so, the kind of a source file is determined in an implementation-defined manner which includes a means of designating a file as a UTF-8 source file, independent of the contents of the source files. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
> > ---
>
> I think Hubert's formulation addresses the concern that we don't
> want to require that a single source file can be separately designated
> as UTF-8 (and others are different). "designating a file" sounds
> dangerously close to that.
>
>
> "which includes a means of designating source files as UTF-8 source files" then. I'm not a fan of "which includes a means of causing the determination to interpret"

Yeah, that feels a bit over-the-top.

> ===
> An implementation shall support physical source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> other kinds of physical source files, and, if so, the kind of a physical source file is determined in an implementation-defined manner which includes a means of designating physical source files as UTF-8 source files, independent of their content. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> For any other kind of physical source file supported by the implementation, characters are mapped to the translation character set (introducing new-line characters for end-of-line indicators).
> ===

Jens