Date: Fri, 10 Jun 2022 15:56:32 +0200
On 10/06/2022 15.29, Corentin wrote:
>
>
> On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 10/06/2022 10.02, Corentin via SG16 wrote:
> > I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).
>
> What exactly do you think is hard to understand?
>
> Personally, I think clearly separating the "input" side from the
> compiler-internal side (translation characters set) is helpful
> in getting the right mental model here. There is a mapping
> stage in phase 1; it might be very thin for UTF-8 input, but
> it's possibly thicker for non-UTF-8 input, so we should not
> try to hide that mapping stage.
>
>
> There is no disagreement here, but encodings and files are orthogonal concerns.
> Imagine an implementation that can read files of both disk/networks/database/etc. all of these inputs are different kind of inputs yet they all may or may not be utf-8.
> Pretending utf-8 ness is related to the kind of medium the source code comes from makes very little sense to me.
Nobody is pretending that. We just pretend everything is a "physical source file"
(nobody knows what that means, btw), and then we say there are different "kinds".
There is no need to dissect the encoding from other properties of a specific kind,
as long as there is at least one kind that uses the desired encoding (UTF-8).
> I'm fine with removing the note, but I would like to see
> the parenthetical
>
> "(introducing new-line characters for end-of-line indicators)"
>
> restored for the "any other kind" case.
> (Omitting the parenthetical feels like a regression.)
>
>
> It's not, as we reformulated that sentence.
I'm not seeing a material reformulation that would make the
parenthetical more superfluous.
Status quo:
Physical source file
characters are mapped, in an implementation-defined manner,
to the translation character set (5.3)
(introducing new-line characters for end-of-line indicators).
New text (Hubert's proposal):
For any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner,
to a sequence of translation character set elements.
> I'm happy leaving the parenthese here
Good.
> as long as we remove it as part of P2348
Making the processing of the present paper somehow dependent on the processing
of a future paper is ... not a good approach. We'll review P2348 based on its
own merits.
> > That being said, as this wording seems to have more consensus, maybe we can go with some form of it, it achieves the intent of the paper.
> >
> > ---
> > An implementation shall support source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> > other kinds of source files, and, if so, the kind of a source file is determined in an implementation-defined manner which includes a means of designating a file as a UTF-8 source file, independent of the contents of the source files. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
> > ---
>
> I think Hubert's formulation addresses the concern that we don't
> want to require that a single source file can be separately designated
> as UTF-8 (and others are different). "designating a file" sounds
> dangerously close to that.
>
>
> "which includes a means of designating source files as UTF-8 source files" then. I'm not a fan of "which includes a means of causing the determination to interpret"
Yeah, that feels a bit over-the-top.
> ===
> An implementation shall support physical source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> other kinds of physical source files, and, if so, the kind of a physical source file is determined in an implementation-defined manner which includes a means of designating physical source files as UTF-8 source files, independent of their content. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> For any other kind of physical source file supported by the implementation, characters are mapped to the translation character set (introducing new-line characters for end-of-line indicators).
> ===
Jens
>
>
> On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 10/06/2022 10.02, Corentin via SG16 wrote:
> > I'm concerned that this approach will be hard to understand by people who have not followed the discussions, on top of preexisting obfuscations (the translation set indirection).
>
> What exactly do you think is hard to understand?
>
> Personally, I think clearly separating the "input" side from the
> compiler-internal side (translation characters set) is helpful
> in getting the right mental model here. There is a mapping
> stage in phase 1; it might be very thin for UTF-8 input, but
> it's possibly thicker for non-UTF-8 input, so we should not
> try to hide that mapping stage.
>
>
> There is no disagreement here, but encodings and files are orthogonal concerns.
> Imagine an implementation that can read files of both disk/networks/database/etc. all of these inputs are different kind of inputs yet they all may or may not be utf-8.
> Pretending utf-8 ness is related to the kind of medium the source code comes from makes very little sense to me.
Nobody is pretending that. We just pretend everything is a "physical source file"
(nobody knows what that means, btw), and then we say there are different "kinds".
There is no need to dissect the encoding from other properties of a specific kind,
as long as there is at least one kind that uses the desired encoding (UTF-8).
> I'm fine with removing the note, but I would like to see
> the parenthetical
>
> "(introducing new-line characters for end-of-line indicators)"
>
> restored for the "any other kind" case.
> (Omitting the parenthetical feels like a regression.)
>
>
> It's not, as we reformulated that sentence.
I'm not seeing a material reformulation that would make the
parenthetical more superfluous.
Status quo:
Physical source file
characters are mapped, in an implementation-defined manner,
to the translation character set (5.3)
(introducing new-line characters for end-of-line indicators).
New text (Hubert's proposal):
For any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner,
to a sequence of translation character set elements.
> I'm happy leaving the parenthese here
Good.
> as long as we remove it as part of P2348
Making the processing of the present paper somehow dependent on the processing
of a future paper is ... not a good approach. We'll review P2348 based on its
own merits.
> > That being said, as this wording seems to have more consensus, maybe we can go with some form of it, it achieves the intent of the paper.
> >
> > ---
> > An implementation shall support source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> > other kinds of source files, and, if so, the kind of a source file is determined in an implementation-defined manner which includes a means of designating a file as a UTF-8 source file, independent of the contents of the source files. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements.
> > ---
>
> I think Hubert's formulation addresses the concern that we don't
> want to require that a single source file can be separately designated
> as UTF-8 (and others are different). "designating a file" sounds
> dangerously close to that.
>
>
> "which includes a means of designating source files as UTF-8 source files" then. I'm not a fan of "which includes a means of causing the determination to interpret"
Yeah, that feels a bit over-the-top.
> ===
> An implementation shall support physical source files that are a sequence of UTF-8 code units (UTF-8 source files). It may also support an implementation-defined set of
> other kinds of physical source files, and, if so, the kind of a physical source file is determined in an implementation-defined manner which includes a means of designating physical source files as UTF-8 source files, independent of their content. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
>
> If a physical source file is designated or otherwise determined to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set.
> For any other kind of physical source file supported by the implementation, characters are mapped to the translation character set (introducing new-line characters for end-of-line indicators).
> ===
Jens
Received on 2022-06-10 13:56:36