sg16: Re: [SG16] Conversion of grapheme clusters to (wide) execution encoding

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 1 Jun 2020 07:48:33 +0000

Out of the 5 options laid out, I feel it would be best to make it ill-formed. The source code author's intent is (at best) ambiguous, and the onus should not be placed on the compiler to try to make it work. The status quo allows for programs to be broken in surprising ways.

Best regards,

Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]rg> On Behalf Of Jens Maurer via
> SG16
> Sent: 01 June 2020 07:10
> To: sg16_at_[hidden]
> Cc: Jens Maurer <Jens.Maurer_at_gmx.net>; Corentin <corentin.jabot_at_gmail.com>
> Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution
> encoding
>
> EXTERNAL MAIL
>
>
> On 01/06/2020 00.21, Corentin via SG16 wrote:
> > Hello
> >
> > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
> ACCENT).
> >
> > There is some consensus in SG-16 that this should not be normalized in
> phase 1, or in phase 5 if the execution encoding of that string literal
> encode the Unicode character set.
> > However, what should happen if the execution character set is Latin 1,
> for example?
> >
> > ACUTE ACCENT does not have an implementation in latin 1, but the
> grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
> LETTER E WITH ACUTE has a representation in the latin character set
> >
> > This is currently implementation defined ("e?" in msvc, ill-formed in
> GCC and Clang), but the wording is specific about the conversion happening
> independently for each code point.
> >
> > I think we have several options:
> >
> > 1. Status quo
> > 2. Making the conversion ill formed as per P1854R0 Conversion to
> execution encoding should not lead to loss of
> meaning https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
> iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
> > 3. Allowing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points
> > 4. Forcing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points.
> > 5. Conversion to NFC(K?) before conversion to a non unicode character
> set, but that may maybe introduce further issues and adds burden on
> implementation
> >
> >
> > Option 4 seems hardly implementable in all cases.
> > Option 2 and 5 offer the most consistency across implementations
> > Option 3, 4, 5 may be a behavior change
> >
> > I think i have a preference for 3.
> >
> > What do you think?
>
> String literals also have an inherent length. I'm mildly opposed to
> normatively
> specifying a required alteration of the "source-code-apparent" length for
> types
> whose encoding are not variable-width to start with (u8, u16). That
> leaves
> 1 and 2 for me.
>
> Jens
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
> sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
> mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$

Received on 2020-06-01 02:51:44