C++ Logo

SG16

Advanced search

Subject: Re: Conversion of grapheme clusters to (wide) execution encoding
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-01 12:23:08


On 6/1/20 10:47 AM, Corentin via SG16 wrote:
>
>
> On Mon, Jun 1, 2020, 16:38 Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>> wrote:
>
> Hi Hubert,
>
> As I understand it (I’m probably wrong), u, U and u8 literals are
> an explicit request for a UTF-encoded literal to appear in the
> executable.
>
> Corentin is raising the issue that a valid ‘normal’ string literal
> might not be able to be losslessly encoded to the
> implementation-defined execution encoding.  I don’t think this
> concern affects u, U or u8 string literals.
>
> In the past, SG16 discussed u, U or u8 literals that do not decode
> successfully. These can exist in validly-encoded source files by
> using \x escape sequences in a literal (u8"\xff"). I’m not sure
> what the outcome of phase 1 of translation is for literals like
> this, but they would certainly not be able to be losslessly
> encoded in a UTF encoding. I don’t recall what the outcome of that
> discussion was. I’m not sure whether or not this is the same
> problem Corentin is describing.
>
>
> It's not. Only \u are transformed in phase one.
> Other escape sequences are converted as part of phase 5.
> Tom has wording that states \ooo and \xxx represent the value of a
> single code unit in the execution character set, I have wording to
> make those ill-formed if the value is greater than the maximum value
> representable for the code unit type (it is already ill formed for u,
> U, u8 literals)

The relevant wording indicating that this is currently
implementation-defined behavior for ordinary/wide literals is in
[lex.ccon]p7 <http://eel.is/c++draft/lex.ccon#7>.

Tom.

>                 Peter
>
> *From:*Hubert Tong <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>>
> *Sent:* 01 June 2020 15:05
> *To:* SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
> *Cc:* Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>>
> *Subject:* Re: [SG16] Conversion of grapheme clusters to (wide)
> execution encoding
>
> EXTERNAL MAIL
>
> On Mon, Jun 1, 2020 at 5:22 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>> wrote:
>
> Out of the 5 options laid out, I feel it would be best to
> make it ill-formed.  The source code author's intent is
> (at best) ambiguous, and the onus should not be placed on
> the compiler to try to make it work.  The status quo
> allows for programs to be broken in surprising ways.
>
> I think you might be right.
>
> Especially given making it work add implementation burden for
> the sake of legacy encodings and there is a somewhat easy fix:
> nfc normalize your sources.
>
> Wait, does this apply to u, U, and u8 strings? Users can't have
> non-NFC-normalized strings?
>
>
> Best regards,
>
>                      Peter
>
> > -----Original Message-----
> > From: SG16 <sg16-bounces_at_[hidden]
> <mailto:sg16-bounces_at_[hidden]>> On Behalf Of Jens
> Maurer via
> > SG16
> > Sent: 01 June 2020 07:10
> > To: sg16_at_[hidden] <mailto:sg16_at_[hidden]>
> > Cc: Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>>; Corentin
> <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>>
> > Subject: Re: [SG16] Conversion of grapheme clusters to
> (wide) execution
> > encoding
> >
> > EXTERNAL MAIL
> >
> >
> > On 01/06/2020 00.21, Corentin via SG16 wrote:
> > > Hello
> > >
> > > Consider a string literal "e\u00B4" (LATIN SMALL
> LETTER E, ACUTE
> > ACCENT).
> > >
> > > There is some consensus in SG-16 that this should not
> be normalized in
> > phase 1, or in phase 5 if the execution encoding of that
> string literal
> > encode the Unicode character set.
> > > However, what should happen if the execution character
> set is Latin 1,
> > for example?
> > >
> > > ACUTE ACCENT does not have an implementation in latin
> 1, but the
> > grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does
> as LATIN SMALL
> > LETTER E WITH ACUTE has a representation in the
> latin character set
> > >
> > > This is currently implementation defined ("e?" in
> msvc, ill-formed in
> > GCC and Clang), but the wording is specific about the
> conversion happening
> > independently for each code point.
> > >
> > > I think we have several options:
> > >
> > >  1. Status quo
> > >  2. Making  the conversion ill formed as per  P1854R0
> Conversion to
> > execution encoding should not lead to loss of
> > meaning
> https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
> <https://urldefense.com/v3/__https:/wg21.link/p1854r__;!!EHscmS1yg>
> >
> iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
> > >  3. Allowing an implementation to transform each
> abstract character to
> > another abstract character represented by more of fewer
> code points
> > >  4. Forcing an implementation to transform each
> abstract character to
> > another abstract character represented by more of fewer
> code points.
> > >  5. Conversion to NFC(K?) before conversion to a non
> unicode character
> > set, but that may maybe introduce further issues and
> adds burden on
> > implementation
> > >
> > >
> > > Option 4 seems hardly implementable in all cases.
> > > Option 2 and 5 offer the most consistency
> across implementations
> > > Option 3, 4, 5 may be a behavior change
> > >
> > > I think i have a preference for 3.
> > >
> > > What do you think?
> >
> > String literals also have an inherent length.  I'm
> mildly opposed to
> > normatively
> > specifying a required alteration of the
> "source-code-apparent" length for
> > types
> > whose encoding are not variable-width to start with (u8,
> u16).  That
> > leaves
> > 1 and 2 for me.
> >
> > Jens
> >
> > --
> > SG16 mailing list
> > SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> >
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/>
> > sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
> > mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!QMzPLBPWPkDsfR_6XD6g9r0blVOwTFSqEzL0J1tVzKyFSQxhTHP3STnu2gHtMQ$>
>
>



SG16 list run by sg16-owner@lists.isocpp.org