On Mon, Jun 1, 2020 at 10:13 AM Corentin <corentin.jabot@gmail.com> wrote:


On Mon, Jun 1, 2020, 16:04 Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Mon, Jun 1, 2020 at 5:22 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:


On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett@cadence.com> wrote:
Out of the 5 options laid out, I feel it would be best to make it ill-formed.  The source code author's intent is (at best) ambiguous, and the onus should not be placed on the compiler to try to make it work.  The status quo allows for programs to be broken in surprising ways.

I think you might be right.
Especially given making it work add implementation burden for the sake of legacy encodings and there is a somewhat easy fix: nfc normalize your sources. 
Wait, does this apply to u, U, and u8 strings? Users can't have non-NFC-normalized strings?

Well, that was the original question:
If the source is nfd normalized, and the execution encoding is not a Unicode encoding, what should be done about combining characters?
Okay. I guess I got the answer. For "plain" strings, there is some desire to make some problematic sequences go away. The perceived badness goes away with NFC normalization and users applying a broad NFC normalization upon their Unicode source may suffer loss to the integrity of their Unicode literals.

The proposed ill-formedness is rather dependent on how "characters" in source code are identified though. Given the intent to preserve (no specific) normalization and therefore non-normalization to maintain the integrity of Unicode literals during the phases of translation, by "characters" we do mean (for UTF-8 source) UCS scalar values.
 

And yes, normalizing the source does not preserve Unicode literals.

Question is, is that a reasonable work around?

As for normalization, I am writing wording to make sure normalization is preserved in phase 1 when applicable.



 
 

Best regards,

                     Peter

> -----Original Message-----
> From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Jens Maurer via
> SG16
> Sent: 01 June 2020 07:10
> To: sg16@lists.isocpp.org
> Cc: Jens Maurer <Jens.Maurer@gmx.net>; Corentin <corentin.jabot@gmail.com>
> Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution
> encoding
>
> EXTERNAL MAIL
>
>
> On 01/06/2020 00.21, Corentin via SG16 wrote:
> > Hello
> >
> > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
> ACCENT).
> >
> > There is some consensus in SG-16 that this should not be normalized in
> phase 1, or in phase 5 if the execution encoding of that string literal
> encode the Unicode character set.
> > However, what should happen if the execution character set is Latin 1,
> for example?
> >
> > ACUTE ACCENT does not have an implementation in latin 1, but the
> grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
> LETTER E WITH ACUTE has a representation in the latin character set
> >
> > This is currently implementation defined ("e?" in msvc, ill-formed in
> GCC and Clang), but the wording is specific about the conversion happening
> independently for each code point.
> >
> > I think we have several options:
> >
> >  1. Status quo
> >  2. Making  the conversion ill formed as per  P1854R0 Conversion to
> execution encoding should not lead to loss of
> meaning https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
> iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
> >  3. Allowing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points
> >  4. Forcing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points.
> >  5. Conversion to NFC(K?) before conversion to a non unicode character
> set, but that may maybe introduce further issues and adds burden on
> implementation
> >
> >
> > Option 4 seems hardly implementable in all cases.
> > Option 2 and 5 offer the most consistency across implementations
> > Option 3, 4, 5 may be a behavior change
> >
> > I think i have a preference for 3.
> >
> > What do you think?
>
> String literals also have an inherent length.  I'm mildly opposed to
> normatively
> specifying a required alteration of the "source-code-apparent" length for
> types
> whose encoding are not variable-width to start with (u8, u16).  That
> leaves
> 1 and 2 for me.
>
> Jens
>
> --
> SG16 mailing list
> SG16@lists.isocpp.org
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
> sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
> mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16