C++ Logo

sg16

Advanced search

Re: [SG16] Conversion of grapheme clusters to (wide) execution encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 19:30:26 +0200
On Mon, 1 Jun 2020 at 19:26, Tom Honermann <tom_at_[hidden]> wrote:

> On 5/31/20 6:21 PM, Corentin via SG16 wrote:
>
> Hello
>
> Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE ACCENT).
>
> I think you meant "e\u0301" here. U+00B4 is not a combining acute accent;
> U+0301 is.
>
>
> There is some consensus in SG-16 that this should not be normalized in
> phase 1, or in phase 5 if the execution encoding of that string literal
> encode the Unicode character set.
> However, what should happen if the execution character set is Latin 1, for
> example?
>
> ACUTE ACCENT does not have an implementation in latin 1, but the grapheme
> cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL LETTER E
> WITH ACUTE has a representation in the latin character set
>
> And I think you mean COMBINING ACUTE ACCENT here (U+0301).
>
>
> This is currently implementation defined ("e?" in msvc, ill-formed in GCC
> and Clang), but the wording is specific about the conversion happening
> independently for each code point.
>
> I think we have several options:
>
> 1. Status quo
> 2. Making the conversion ill formed as per P1854R0 Conversion to
> execution encoding should not lead to loss of meaning
> https://wg21.link/p1854r
> 3. Allowing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points
> 4. Forcing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points.
> 5. Conversion to NFC(K?) before conversion to a non unicode character
> set, but that may maybe introduce further issues and adds burden on
> implementation
>
>
> Option 4 seems hardly implementable in all cases.
> Option 2 and 5 offer the most consistency across implementations
> Option 3, 4, 5 may be a behavior change
>
> I think i have a preference for 3.
>
> What do you think?
>
> My answer depends on if you really intended U+00B4 as opposed to U+0301.
>
I did meant U+301 - sorry about that. (or any valid codepoint sequence that
consitute a single grapheme or abstract character with known representation
in the literal's encoding)

> Tom.
>

Received on 2020-06-01 12:33:44