C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 10 Jun 2022 17:07:42 +0200
I know, I can either do that change in this paper, or the whitespace paper.
It doesn't make sense to reword in terms of sequence and also keep the
parenthesis.
Pick one, i don't mind as we will get there _eventually_

On Fri, Jun 10, 2022 at 4:20 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Fri, Jun 10, 2022 at 9:56 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 10/06/2022 15.29, Corentin wrote:
>> >
>> >
>> > On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 10/06/2022 10.02, Corentin via SG16 wrote:
>> > > I'm concerned that this approach will be hard to understand by
>> people who have not followed the discussions, on top of preexisting
>> obfuscations (the translation set indirection).
>> >
>> > What exactly do you think is hard to understand?
>> >
>> > Personally, I think clearly separating the "input" side from the
>> > compiler-internal side (translation characters set) is helpful
>> > in getting the right mental model here. There is a mapping
>> > stage in phase 1; it might be very thin for UTF-8 input, but
>> > it's possibly thicker for non-UTF-8 input, so we should not
>> > try to hide that mapping stage.
>> >
>> >
>> > There is no disagreement here, but encodings and files are orthogonal
>> concerns.
>> > Imagine an implementation that can read files of both
>> disk/networks/database/etc. all of these inputs are different kind of
>> inputs yet they all may or may not be utf-8.
>> > Pretending utf-8 ness is related to the kind of medium the source code
>> comes from makes very little sense to me.
>>
>> Nobody is pretending that. We just pretend everything is a "physical
>> source file"
>> (nobody knows what that means, btw), and then we say there are different
>> "kinds".
>>
>> There is no need to dissect the encoding from other properties of a
>> specific kind,
>> as long as there is at least one kind that uses the desired encoding
>> (UTF-8).
>>
>> > I'm fine with removing the note, but I would like to see
>> > the parenthetical
>> >
>> > "(introducing new-line characters for end-of-line indicators)"
>> >
>> > restored for the "any other kind" case.
>> > (Omitting the parenthetical feels like a regression.)
>> >
>> >
>> > It's not, as we reformulated that sentence.
>>
>> I'm not seeing a material reformulation that would make the
>> parenthetical more superfluous.
>>
>> Status quo:
>>
>> Physical source file
>> characters are mapped, in an implementation-defined manner,
>> to the translation character set (5.3)
>> (introducing new-line characters for end-of-line indicators).
>>
>> New text (Hubert's proposal):
>>
>> For any other kind of physical source file supported by the
>> implementation,
>> characters are mapped, in an implementation-defined manner,
>> to a sequence of translation character set elements.
>>
>
> @Corentin <corentin.jabot_at_[hidden]> @Jens Maurer <Jens.Maurer_at_[hidden]>,
> the key reason why my version of the text does not need the parenthetical
> is that the implementation-definedness encompasses the creation of the
> sequence from the aggregate input "characters" (which can include more than
> what the character is, but also where the character is). Versions of the
> wording that skew towards considering only the characters and what they are
> is undesired without the parenthetical.
>
>
>>
>> > I'm happy leaving the parenthese here
>>
>> Good.
>>
>> > as long as we remove it as part of P2348
>>
>> Making the processing of the present paper somehow dependent on the
>> processing
>> of a future paper is ... not a good approach. We'll review P2348 based
>> on its
>> own merits.
>>
>> > > That being said, as this wording seems to have more consensus,
>> maybe we can go with some form of it, it achieves the intent of the paper.
>> > >
>> > > ---
>> > > An implementation shall support source files that are a sequence
>> of UTF-8 code units (UTF-8 source files). It may also support an
>> implementation-defined set of
>> > > other kinds of source files, and, if so, the kind of a source
>> file is determined in an implementation-defined manner which includes a
>> means of designating a file as a UTF-8 source file, independent of the
>> contents of the source files. [Note: In other words, recognizing the U+FEFF
>> Byte Order Mark is not sufficient. --end note]
>> > >
>> > > If a physical source file is designated or otherwise determined
>> to be a UTF-8 source file, then it shall be a well-formed UTF-8 code unit
>> sequence and it is decoded to produce a sequence of UCS scalar values that
>> constitutes the sequence of elements of the translation character set.
>> > > For any other kind of physical source file supported by the
>> implementation, characters are mapped, in an implementation-defined manner,
>> to a sequence of translation character set elements.
>> > > ---
>> >
>> > I think Hubert's formulation addresses the concern that we don't
>> > want to require that a single source file can be separately
>> designated
>> > as UTF-8 (and others are different). "designating a file" sounds
>> > dangerously close to that.
>> >
>> >
>> > "which includes a means of designating source files as UTF-8 source
>> files" then. I'm not a fan of "which includes a means of causing the
>> determination to interpret"
>>
>> Yeah, that feels a bit over-the-top.
>>
>> > ===
>> > An implementation shall support physical source files that are a
>> sequence of UTF-8 code units (UTF-8 source files). It may also support an
>> implementation-defined set of
>> > other kinds of physical source files, and, if so, the kind of a
>> physical source file is determined in an implementation-defined manner
>> which includes a means of designating physical source files as UTF-8 source
>> files, independent of their content. [Note: In other words, recognizing the
>> U+FEFF Byte Order Mark is not sufficient. --end note]
>> >
>> > If a physical source file is designated or otherwise determined to be a
>> UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
>> and it is decoded to produce a sequence of UCS scalar values that
>> constitutes the sequence of elements of the translation character set.
>> > For any other kind of physical source file supported by the
>> implementation, characters are mapped to the translation character set
>> (introducing new-line characters for end-of-line indicators).
>> > ===
>>
>> Jens
>>
>>

Received on 2022-06-10 15:07:54