C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 10 Jun 2022 15:29:02 +0200
On Fri, Jun 10, 2022 at 11:08 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 10/06/2022 10.02, Corentin via SG16 wrote:
> > I'm concerned that this approach will be hard to understand by people
> who have not followed the discussions, on top of preexisting obfuscations
> (the translation set indirection).
>
> What exactly do you think is hard to understand?
>
> Personally, I think clearly separating the "input" side from the
> compiler-internal side (translation characters set) is helpful
> in getting the right mental model here. There is a mapping
> stage in phase 1; it might be very thin for UTF-8 input, but
> it's possibly thicker for non-UTF-8 input, so we should not
> try to hide that mapping stage.
>

There is no disagreement here, but encodings and files are orthogonal
concerns.
Imagine an implementation that can read files of both
disk/networks/database/etc. all of these inputs are different kind of
inputs yet they all may or may not be utf-8.
Pretending utf-8 ness is related to the kind of medium the source code
comes from makes very little sense to me.



>
> > It's also very repetitive but maybe we can massage that a bit.
>
> I'm not seeing serious repetition if you take phrases such
> as "UTF-8 code units" as words of power.
>
> > Lastly, I really don't like the " There are no end-of-line indicators
> apart from the content of the UTF-8 code unit sequence" which is more
> confusing than enlightening.
>
> I'm fine with removing the note, but I would like to see
> the parenthetical
>
> "(introducing new-line characters for end-of-line indicators)"
>
> restored for the "any other kind" case.
> (Omitting the parenthetical feels like a regression.)
>

It's not, as we reformulated that sentence. I'm happy leaving the
parenthese here as long as we remove it as part of P2348

>
> > It's also unfortunate that the utf-8-ness is tied to a medium rather
> than the content,
>
> I don't follow. We can't rely on "content" alone, because we want to
> diagnose
> ill-formed UTF-8 code units. If we relied on "content" alone, an
> ill-formed
> UTF-8 code unit would, by definition, make the source file "not UTF-8",
> and we'd
> lose the diagnostic.
>

See above.


>
> > and that we can't agree that source code is text, or that any textual
> data consumed by an implementation has an associated encoding.
>
> We've spent quite a bit of time in CWG teleconferences
> discussing this aspect, and there was no tangible progress,
> so I'm not sure it's useful to reiterate that point.
>
> > I'd also prefer using "input" in lieu of "physical source files", as we
> established physical source files may not be files nor be physical.
>
> The existing text in the standard uses "physical source file",
> so it seems less risky overall to leave that phrase as-is and
> let those that feel uneasy about the term make a separate
> proposal.
>

Will do.


>
> > That being said, as this wording seems to have more consensus, maybe we
> can go with some form of it, it achieves the intent of the paper.
> >
> > ---
> > An implementation shall support source files that are a sequence of
> UTF-8 code units (UTF-8 source files). It may also support an
> implementation-defined set of
> > other kinds of source files, and, if so, the kind of a source file is
> determined in an implementation-defined manner which includes a means of
> designating a file as a UTF-8 source file, independent of the contents of
> the source files. [Note: In other words, recognizing the U+FEFF Byte Order
> Mark is not sufficient. --end note]
> >
> > If a physical source file is designated or otherwise determined to be a
> UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
> and it is decoded to produce a sequence of UCS scalar values that
> constitutes the sequence of elements of the translation character set.
> > For any other kind of physical source file supported by the
> implementation, characters are mapped, in an implementation-defined manner,
> to a sequence of translation character set elements.
> > ---
>
> I think Hubert's formulation addresses the concern that we don't
> want to require that a single source file can be separately designated
> as UTF-8 (and others are different). "designating a file" sounds
> dangerously close to that.
>

"which includes a means of designating source files as UTF-8 source files"
then. I'm not a fan of "which includes a means of causing the determination
to interpret"


===
An implementation shall support physical source files that are a sequence
of UTF-8 code units (UTF-8 source files). It may also support an
implementation-defined set of
other kinds of physical source files, and, if so, the kind of a physical
source file is determined in an implementation-defined manner which
includes a means of designating physical source files as UTF-8 source
files, independent of their content. [Note: In other words, recognizing the
U+FEFF Byte Order Mark is not sufficient. --end note]

If a physical source file is designated or otherwise determined to be a
UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
and it is decoded to produce a sequence of UCS scalar values that
constitutes the sequence of elements of the translation character set.
For any other kind of physical source file supported by the implementation,
characters are mapped to the translation character set (introducing
new-line characters for end-of-line indicators).
===

Knowing that:
- "(introducing new-line characters for end-of-line indicators)" is removed
by P2348R0
- I would make a core issue/nb comment/etc to get rid of the term "physical
source file" in favor of input.



>
> I also like in Hubert's suggestion that "shall be well-formed"
> is a separate sentence, clearly separating normative diagnostics
> from the follow-on processing.
> What I would like to see in Hubert's drafting is a global replacement
> of "source file" with "physical source file", for extra clarity
> that we're talking about the same thing in the entire paragraph,
> and to defuse any suspicion that "source file" and "physical source file"
> might mean different things.
>
> Jens
>
>
> > On Fri, Jun 10, 2022 at 1:07 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden] <mailto:hubert.reinterpretcast_at_[hidden]>>
> wrote:
> >
> > I also prefer the direction of Option 2. I share Mike's concerns
> that "UTF-8 source file" comes out of nowhere in Option 2 as presented.
> >
> > Additionally, I previously gave feedback that there was no
> requirement out of SG16 for there to be an ability to individually
> designate files as UTF-8 source files (as opposed to having a mode where
> all source files are considered UTF-8 source files).
> >
> > Corentin, if you have concrete objections to the following, please
> express them:
> > An implementation shall support physical source files that are a
> sequence of UTF-8 code units. It may also support an implementation-defined
> set of
> > other kinds of source files, and, if so, the kind of a source file
> is determined in an implementation-defined manner which includes a means of
> causing the determination to interpret files as sequences of UTF-8 code
> units, independent of the contents of the source files. [Note: In other
> words, recognizing the U+FEFF Byte Order Mark is not sufficient. --end note]
> >
> > If a physical source file is determined to consist of a sequence of
> UTF-8 code units, then it shall be a well-formed UTF-8 code unit sequence
> and its content is decoded to produce a sequence of UCS scalar values that
> constitutes the sequence of elements of the translation character set. [
> Note: There are no end-of-line indicators apart from the content of the
> UTF-8 code unit sequence. — end note ]
> >
> > For any other kind of physical source file supported by the
> implementation, characters are mapped, in an implementation-defined manner,
> to a sequence of translation character set elements.
> >
> > On Thu, Jun 9, 2022 at 10:23 AM Corentin via Core <
> core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
> >
> >
> > Hello folks,
> > We have not talked about P2295 for a while, but given that
> multiple people have signaled to me they are interested in seeing progress,
> > I would like to see whether we can find a majority consensus on
> wording.
> > We have 2 options to choose from, I have a very strong
> preference for option 1 which is a more direct description of reality ("a
> kind of source file" as suggested by option 2 is a bit too vacuous for my
> taste).
> >
> > The last sentence of both wordings is extracted from P2348
> - Whitespaces Wording Revamp, as this avoids having to retain a note about
> "end of line indicator" for the non utf-8 case, and a note saying there are
> no such "end of line indicator" for the ut-8 case. The term "end of line
> indicator" was never defined, and because the mapping is implementation
> defined, it is a given that implementations can introduce whatever
> characters they like.
> >
> > I tweaked option 2 slightly from what was suggested by
> Mike/Huber to avoid repetition of the definition of a UTF-8 source file.
> >
> > It is important to me that, in addition to achieving the design
> goals of P2295, the wording remains as clear as possible.
> >
> > Let me know what you think.
> >
> > Regards,
> >
> > Corentin
> >
> > _Option 1
> > _
> > A source file is a sequence of integers with an associated
> encoding scheme that is determined in an implementation-defined manner.
> > An implementation shall support the UTF-8 encoding scheme, and
> may support an implementation-defined set of additional encoding schemes.
> > If encoding schemes other than UTF-8 are supported, an
> implementation shall provide a means by which the UTF-8 encoding scheme can
> be specified, independent of the content of that source file. [Note: In
> other words, recognizing the U+FEFF Byte Order Mark is not sufficient.
> --end note]
> >
> > If the encoding scheme of a source file is determined to be
> UTF-8, then the source file shall be a well-formed UTF-8 code unit
> sequence. The source file is decoded to produce a sequence of UCS scalar
> values that constitutes the sequence of elements of the translation
> character set.
> >
> > For any other encoding scheme supported by the implementation,
> source file characters are mapped, in an implementation-defined manner, to
> a sequence of translation character set elements.
> >
> > _Option 2: _
> >
> > An implementation shall support UTF-8 source files. It may also
> support an implementation-defined set of other kinds of source files, and,
> if so, it shall provide an implementation-defined means of designating a
> file as a UTF-8 source file, independent of the content of that source
> file. [Note: In other words, recognizing the U+FEFF Byte Order Mark is not
> sufficient. --end note].
> >
> > If a source file is determined to be a UTF-8 source file, then
> it shall be a well-formed UTF-8 code unit sequence and its content is
> decoded to produce a sequence of UCS scalar values that constitutes the
> sequence of elements of the translation character set.
> >
> > For any other kind of source file, characters are mapped, in an
> implementation-defined manner, to a sequence of translation character set
> elements.
> > _______________________________________________
> > Core mailing list
> > Core_at_[hidden] <mailto:Core_at_[hidden]>
> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> <https://lists.isocpp.org/mailman/listinfo.cgi/core>
> > Link to this post:
> http://lists.isocpp.org/core/2022/06/12669.php <
> http://lists.isocpp.org/core/2022/06/12669.php>
> >
> >
>
>

Received on 2022-06-10 13:29:13