C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Fri, 10 Jun 2022 10:20:17 -0400
On Fri, Jun 10, 2022 at 9:56 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 10/06/2022 15.29, Corentin wrote:
> >
> >
> > On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 10/06/2022 10.02, Corentin via SG16 wrote:
> > > I'm concerned that this approach will be hard to understand by
> people who have not followed the discussions, on top of preexisting
> obfuscations (the translation set indirection).
> >
> > What exactly do you think is hard to understand?
> >
> > Personally, I think clearly separating the "input" side from the
> > compiler-internal side (translation characters set) is helpful
> > in getting the right mental model here. There is a mapping
> > stage in phase 1; it might be very thin for UTF-8 input, but
> > it's possibly thicker for non-UTF-8 input, so we should not
> > try to hide that mapping stage.
> >
> >
> > There is no disagreement here, but encodings and files are orthogonal
> concerns.
> > Imagine an implementation that can read files of both
> disk/networks/database/etc. all of these inputs are different kind of
> inputs yet they all may or may not be utf-8.
> > Pretending utf-8 ness is related to the kind of medium the source code
> comes from makes very little sense to me.
>
> Nobody is pretending that. We just pretend everything is a "physical
> source file"
> (nobody knows what that means, btw), and then we say there are different
> "kinds".
>
> There is no need to dissect the encoding from other properties of a
> specific kind,
> as long as there is at least one kind that uses the desired encoding
> (UTF-8).
>
> > I'm fine with removing the note, but I would like to see
> > the parenthetical
> >
> > "(introducing new-line characters for end-of-line indicators)"
> >
> > restored for the "any other kind" case.
> > (Omitting the parenthetical feels like a regression.)
> >
> >
> > It's not, as we reformulated that sentence.
>
> I'm not seeing a material reformulation that would make the
> parenthetical more superfluous.
>
> Status quo:
>
> Physical source file
> characters are mapped, in an implementation-defined manner,
> to the translation character set (5.3)
> (introducing new-line characters for end-of-line indicators).
>
> New text (Hubert's proposal):
>
> For any other kind of physical source file supported by the implementation,
> characters are mapped, in an implementation-defined manner,
> to a sequence of translation character set elements.
>

@Corentin <corentin.jabot_at_[hidden]> @Jens Maurer <Jens.Maurer_at_[hidden]>,
the key reason why my version of the text does not need the parenthetical
is that the implementation-definedness encompasses the creation of the
sequence from the aggregate input "characters" (which can include more than
what the character is, but also where the character is). Versions of the
wording that skew towards considering only the characters and what they are
is undesired without the parenthetical.


>
> > I'm happy leaving the parenthese here
>
> Good.
>
> > as long as we remove it as part of P2348
>
> Making the processing of the present paper somehow dependent on the
> processing
> of a future paper is ... not a good approach. We'll review P2348 based on
> its
> own merits.
>
> > > That being said, as this wording seems to have more consensus,
> maybe we can go with some form of it, it achieves the intent of the paper.
> > >
> > > ---
> > > An implementation shall support source files that are a sequence
> of UTF-8 code units (UTF-8 source files). It may also support an
> implementation-defined set of
> > > other kinds of source files, and, if so, the kind of a source file
> is determined in an implementation-defined manner which includes a means of
> designating a file as a UTF-8 source file, independent of the contents of
> the source files. [Note: In other words, recognizing the U+FEFF Byte Order
> Mark is not sufficient. --end note]
> > >
> > > If a physical source file is designated or otherwise determined to
> be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit
> sequence and it is decoded to produce a sequence of UCS scalar values that
> constitutes the sequence of elements of the translation character set.
> > > For any other kind of physical source file supported by the
> implementation, characters are mapped, in an implementation-defined manner,
> to a sequence of translation character set elements.
> > > ---
> >
> > I think Hubert's formulation addresses the concern that we don't
> > want to require that a single source file can be separately
> designated
> > as UTF-8 (and others are different). "designating a file" sounds
> > dangerously close to that.
> >
> >
> > "which includes a means of designating source files as UTF-8 source
> files" then. I'm not a fan of "which includes a means of causing the
> determination to interpret"
>
> Yeah, that feels a bit over-the-top.
>
> > ===
> > An implementation shall support physical source files that are a
> sequence of UTF-8 code units (UTF-8 source files). It may also support an
> implementation-defined set of
> > other kinds of physical source files, and, if so, the kind of a physical
> source file is determined in an implementation-defined manner which
> includes a means of designating physical source files as UTF-8 source
> files, independent of their content. [Note: In other words, recognizing the
> U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a
> UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
> and it is decoded to produce a sequence of UCS scalar values that
> constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the
> implementation, characters are mapped to the translation character set
> (introducing new-line characters for end-of-line indicators).
> > ===
>
> Jens
>
>

Received on 2022-06-10 14:20:46