C++ Logo

SG16

Advanced search

Subject: Re: Conversion of grapheme clusters to (wide) execution encoding
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-06-01 04:22:05


On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett_at_[hidden]> wrote:

> Out of the 5 options laid out, I feel it would be best to make it
> ill-formed. The source code author's intent is (at best) ambiguous, and
> the onus should not be placed on the compiler to try to make it work. The
> status quo allows for programs to be broken in surprising ways.
>

I think you might be right.
Especially given making it work add implementation burden for the sake of
legacy encodings and there is a somewhat easy fix: nfc normalize your
sources.

>
> Best regards,
>
> Peter
>
> > -----Original Message-----
> > From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer via
> > SG16
> > Sent: 01 June 2020 07:10
> > To: sg16_at_[hidden]
> > Cc: Jens Maurer <Jens.Maurer_at_[hidden]>; Corentin <
> corentin.jabot_at_[hidden]>
> > Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution
> > encoding
> >
> > EXTERNAL MAIL
> >
> >
> > On 01/06/2020 00.21, Corentin via SG16 wrote:
> > > Hello
> > >
> > > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
> > ACCENT).
> > >
> > > There is some consensus in SG-16 that this should not be normalized in
> > phase 1, or in phase 5 if the execution encoding of that string literal
> > encode the Unicode character set.
> > > However, what should happen if the execution character set is Latin 1,
> > for example?
> > >
> > > ACUTE ACCENT does not have an implementation in latin 1, but the
> > grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
> > LETTER E WITH ACUTE has a representation in the latin character set
> > >
> > > This is currently implementation defined ("e?" in msvc, ill-formed in
> > GCC and Clang), but the wording is specific about the conversion
> happening
> > independently for each code point.
> > >
> > > I think we have several options:
> > >
> > > 1. Status quo
> > > 2. Making the conversion ill formed as per P1854R0 Conversion to
> > execution encoding should not lead to loss of
> > meaning
> https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
> > iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
> > > 3. Allowing an implementation to transform each abstract character to
> > another abstract character represented by more of fewer code points
> > > 4. Forcing an implementation to transform each abstract character to
> > another abstract character represented by more of fewer code points.
> > > 5. Conversion to NFC(K?) before conversion to a non unicode character
> > set, but that may maybe introduce further issues and adds burden on
> > implementation
> > >
> > >
> > > Option 4 seems hardly implementable in all cases.
> > > Option 2 and 5 offer the most consistency across implementations
> > > Option 3, 4, 5 may be a behavior change
> > >
> > > I think i have a preference for 3.
> > >
> > > What do you think?
> >
> > String literals also have an inherent length. I'm mildly opposed to
> > normatively
> > specifying a required alteration of the "source-code-apparent" length for
> > types
> > whose encoding are not variable-width to start with (u8, u16). That
> > leaves
> > 1 and 2 for me.
> >
> > Jens
> >
> > --
> > SG16 mailing list
> > SG16_at_[hidden]
> >
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
> > sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
> > mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
>



SG16 list run by sg16-owner@lists.isocpp.org