C++ Logo

SG16

Advanced search

Subject: Re: Conversion of grapheme clusters to (wide) execution encoding
From: Peter Brett (pbrett_at_[hidden])
Date: 2020-06-01 09:38:49


Hi Hubert,
As I understand it (I’m probably wrong), u, U and u8 literals are an explicit request for a UTF-encoded literal to appear in the executable.

Corentin is raising the issue that a valid ‘normal’ string literal might not be able to be losslessly encoded to the implementation-defined execution encoding. I don’t think this concern affects u, U or u8 string literals.

In the past, SG16 discussed u, U or u8 literals that do not decode successfully. These can exist in validly-encoded source files by using \x escape sequences in a literal (u8"\xff"). I’m not sure what the outcome of phase 1 of translation is for literals like this, but they would certainly not be able to be losslessly encoded in a UTF encoding. I don’t recall what the outcome of that discussion was. I’m not sure whether or not this is the same problem Corentin is describing.

                Peter

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Sent: 01 June 2020 15:05
To: SG16 <sg16_at_[hidden]>
Cc: Peter Brett <pbrett_at_[hidden]>; Corentin <corentin.jabot_at_[hidden]>
Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution encoding

EXTERNAL MAIL
On Mon, Jun 1, 2020 at 5:22 AM Corentin via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:


On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>> wrote:
Out of the 5 options laid out, I feel it would be best to make it ill-formed. The source code author's intent is (at best) ambiguous, and the onus should not be placed on the compiler to try to make it work. The status quo allows for programs to be broken in surprising ways.

I think you might be right.
Especially given making it work add implementation burden for the sake of legacy encodings and there is a somewhat easy fix: nfc normalize your sources.
Wait, does this apply to u, U, and u8 strings? Users can't have non-NFC-normalized strings?



Best regards,

                     Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]<mailto:sg16-bounces_at_[hidden]>> On Behalf Of Jens Maurer via
> SG16
> Sent: 01 June 2020 07:10
> To: sg16_at_[hidden]<mailto:sg16_at_[hidden]>
> Cc: Jens Maurer <Jens.Maurer_at_[hidden]<mailto:Jens.Maurer_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden]<mailto:corentin.jabot_at_[hidden]>>
> Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution
> encoding
>
> EXTERNAL MAIL
>
>
> On 01/06/2020 00.21, Corentin via SG16 wrote:
> > Hello
> >
> > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
> ACCENT).
> >
> > There is some consensus in SG-16 that this should not be normalized in
> phase 1, or in phase 5 if the execution encoding of that string literal
> encode the Unicode character set.
> > However, what should happen if the execution character set is Latin 1,
> for example?
> >
> > ACUTE ACCENT does not have an implementation in latin 1, but the
> grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
> LETTER E WITH ACUTE has a representation in the latin character set
> >
> > This is currently implementation defined ("e?" in msvc, ill-formed in
> GCC and Clang), but the wording is specific about the conversion happening
> independently for each code point.
> >
> > I think we have several options:
> >
> > 1. Status quo
> > 2. Making the conversion ill formed as per P1854R0 Conversion to
> execution encoding should not lead to loss of
> meaning https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg>
> iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
> > 3. Allowing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points
> > 4. Forcing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points.
> > 5. Conversion to NFC(K?) before conversion to a non unicode character
> set, but that may maybe introduce further issues and adds burden on
> implementation
> >
> >
> > Option 4 seems hardly implementable in all cases.
> > Option 2 and 5 offer the most consistency across implementations
> > Option 3, 4, 5 may be a behavior change
> >
> > I think i have a preference for 3.
> >
> > What do you think?
>
> String literals also have an inherent length. I'm mildly opposed to
> normatively
> specifying a required alteration of the "source-code-apparent" length for
> types
> whose encoding are not variable-width to start with (u8, u16). That
> leaves
> 1 and 2 for me.
>
> Jens
>
> --
> SG16 mailing list
> SG16_at_[hidden]<mailto:SG16_at_[hidden]>
>
https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/>
> sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
> mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
--
SG16 mailing list
SG16_at_[hidden]<mailto:SG16_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/sg16$>



SG16 list run by sg16-owner@lists.isocpp.org