On Mon, 1 Jun 2020 at 08:10, Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 01/06/2020 00.21, Corentin via SG16 wrote:
> Hello
>
> Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE ACCENT).
>
> There is some consensus in SG-16 that this should not be normalized in phase 1, or in phase 5 if the execution encoding of that string literal encode the Unicode character set.
> However, what should happen if the execution character set is Latin 1, for example?
>
> ACUTE ACCENT does not have an implementation in latin 1, but the grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL LETTER E WITH ACUTE has a representation in the latin character set
>
> This is currently implementation defined ("e?" in msvc, ill-formed in GCC and Clang), but the wording is specific about the conversion happening independently for each code point.
>
> I think we have several options:
>
> 1. Status quo
> 2. Making the conversion ill formed as per P1854R0 Conversion to execution encoding should not lead to loss of meaning https://wg21.link/p1854r
> 3. Allowing an implementation to transform each abstract character to another abstract character represented by more of fewer code points
> 4. Forcing an implementation to transform each abstract character to another abstract character represented by more of fewer code points.
> 5. Conversion to NFC(K?) before conversion to a non unicode character set, but that may maybe introduce further issues and adds burden on implementation
>
>
> Option 4 seems hardly implementable in all cases.
> Option 2 and 5 offer the most consistency across implementations
> Option 3, 4, 5 may be a behavior change
>
> I think i have a preference for 3.
>
> What do you think?

String literals also have an inherent length. I'm mildly opposed to normatively
specifying a required alteration of the "source-code-apparent" length for types
whose encoding are not variable-width to start with (u8, u16). That leaves
1 and 2 for me.

The length of "©" will be different in utf8 or latin1 for example - it should be defined in the number of code units in the execution encoding independently of the issue at end

https://godbolt.org/z/BXSdRG

Jens