On Mon, 1 Jun 2020 at 19:26, Tom Honermann <tom@honermann.net> wrote:

On 5/31/20 6:21 PM, Corentin via SG16 wrote:

Hello

Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE ACCENT).

I think you meant "e\u0301" here. U+00B4 is not a combining acute accent; U+0301 is.

There is some consensus in SG-16 that this should not be normalized in phase 1, or in phase 5 if the execution encoding of that string literal encode the Unicode character set.

However, what should happen if the execution character set is Latin 1, for example?

ACUTE ACCENT does not have an implementation in latin 1, but the grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL LETTER E WITH ACUTE has a representation in the latin character set

And I think you mean COMBINING ACUTE ACCENT here (U+0301).

This is currently implementation defined ("e?" in msvc, ill-formed in GCC and Clang), but the wording is specific about the conversion happening independently for each code point.

I think we have several options:

Status quo

Making the conversion ill formed as per P1854R0 Conversion to execution encoding should not lead to loss of meaning https://wg21.link/p1854r

Allowing an implementation to transform each abstract character to another abstract character represented by more of fewer code points

Forcing an implementation to transform each abstract character to another abstract character represented by more of fewer code points.

Conversion to NFC(K?) before conversion to a non unicode character set, but that may maybe introduce further issues and adds burden on implementation

Option 4 seems hardly implementable in all cases.

Option 2 and 5 offer the most consistency across implementations

Option 3, 4, 5 may be a behavior change

I think i have a preference for 3.

What do you think?

My answer depends on if you really intended U+00B4 as opposed to U+0301.

I did meant U+301 - sorry about that. (or any valid codepoint sequence that consitute a single grapheme or abstract character with known representation in the literal's encoding)

Tom.