C++ Logo

sg16

Advanced search

Re: [SG16] Conversion of grapheme clusters to (wide) execution encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 11:08:41 +0200
On Mon, 1 Jun 2020 at 08:10, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 01/06/2020 00.21, Corentin via SG16 wrote:
> > Hello
> >
> > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE ACCENT).
> >
> > There is some consensus in SG-16 that this should not be normalized in
> phase 1, or in phase 5 if the execution encoding of that string literal
> encode the Unicode character set.
> > However, what should happen if the execution character set is Latin 1,
> for example?
> >
> > ACUTE ACCENT does not have an implementation in latin 1, but the
> grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
> LETTER E WITH ACUTE has a representation in the latin character set
> >
> > This is currently implementation defined ("e?" in msvc, ill-formed in
> GCC and Clang), but the wording is specific about the conversion happening
> independently for each code point.
> >
> > I think we have several options:
> >
> > 1. Status quo
> > 2. Making the conversion ill formed as per P1854R0 Conversion to
> execution encoding should not lead to loss of meaning
> https://wg21.link/p1854r
> > 3. Allowing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points
> > 4. Forcing an implementation to transform each abstract character to
> another abstract character represented by more of fewer code points.
> > 5. Conversion to NFC(K?) before conversion to a non unicode character
> set, but that may maybe introduce further issues and adds burden on
> implementation
> >
> >
> > Option 4 seems hardly implementable in all cases.
> > Option 2 and 5 offer the most consistency across implementations
> > Option 3, 4, 5 may be a behavior change
> >
> > I think i have a preference for 3.
> >
> > What do you think?
>
> String literals also have an inherent length. I'm mildly opposed to
> normatively
> specifying a required alteration of the "source-code-apparent" length for
> types
> whose encoding are not variable-width to start with (u8, u16). That leaves
> 1 and 2 for me.
>

The length of "©" will be different in utf8 or latin1 for example - it
should be defined in the number of code units in the execution encoding
independently of the issue at end

https://godbolt.org/z/BXSdRG




>
> Jens
>
>

Received on 2020-06-01 04:11:59