C++ Logo

SG16

Advanced search

Subject: Re: Conversion of grapheme clusters to (wide) execution encoding
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-06-01 09:13:43


On Mon, Jun 1, 2020, 16:04 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Mon, Jun 1, 2020 at 5:22 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>>
>>
>> On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett_at_[hidden]> wrote:
>>
>>> Out of the 5 options laid out, I feel it would be best to make it
>>> ill-formed. The source code author's intent is (at best) ambiguous, and
>>> the onus should not be placed on the compiler to try to make it work. The
>>> status quo allows for programs to be broken in surprising ways.
>>>
>>
>> I think you might be right.
>> Especially given making it work add implementation burden for the sake of
>> legacy encodings and there is a somewhat easy fix: nfc normalize your
>> sources.
>>
> Wait, does this apply to u, U, and u8 strings? Users can't have
> non-NFC-normalized strings?
>

Well, that was the original question:
If the source is nfd normalized, and the execution encoding is not a
Unicode encoding, what should be done about combining characters?

And yes, normalizing the source does not preserve Unicode literals.

Question is, is that a reasonable work around?

As for normalization, I am writing wording to make sure normalization is
preserved in phase 1 when applicable.

>
>>
>>
>>>
>>> Best regards,
>>>
>>> Peter
>>>
>>> > -----Original Message-----
>>> > From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer
>>> via
>>> > SG16
>>> > Sent: 01 June 2020 07:10
>>> > To: sg16_at_[hidden]
>>> > Cc: Jens Maurer <Jens.Maurer_at_[hidden]>; Corentin <
>>> corentin.jabot_at_[hidden]>
>>> > Subject: Re: [SG16] Conversion of grapheme clusters to (wide) execution
>>> > encoding
>>> >
>>> > EXTERNAL MAIL
>>> >
>>> >
>>> > On 01/06/2020 00.21, Corentin via SG16 wrote:
>>> > > Hello
>>> > >
>>> > > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
>>> > ACCENT).
>>> > >
>>> > > There is some consensus in SG-16 that this should not be normalized
>>> in
>>> > phase 1, or in phase 5 if the execution encoding of that string literal
>>> > encode the Unicode character set.
>>> > > However, what should happen if the execution character set is Latin
>>> 1,
>>> > for example?
>>> > >
>>> > > ACUTE ACCENT does not have an implementation in latin 1, but the
>>> > grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL
>>> > LETTER E WITH ACUTE has a representation in the latin character set
>>> > >
>>> > > This is currently implementation defined ("e?" in msvc, ill-formed in
>>> > GCC and Clang), but the wording is specific about the conversion
>>> happening
>>> > independently for each code point.
>>> > >
>>> > > I think we have several options:
>>> > >
>>> > > 1. Status quo
>>> > > 2. Making the conversion ill formed as per P1854R0 Conversion to
>>> > execution encoding should not lead to loss of
>>> > meaning
>>> https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
>>> > iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
>>> > > 3. Allowing an implementation to transform each abstract character
>>> to
>>> > another abstract character represented by more of fewer code points
>>> > > 4. Forcing an implementation to transform each abstract character to
>>> > another abstract character represented by more of fewer code points.
>>> > > 5. Conversion to NFC(K?) before conversion to a non unicode
>>> character
>>> > set, but that may maybe introduce further issues and adds burden on
>>> > implementation
>>> > >
>>> > >
>>> > > Option 4 seems hardly implementable in all cases.
>>> > > Option 2 and 5 offer the most consistency across implementations
>>> > > Option 3, 4, 5 may be a behavior change
>>> > >
>>> > > I think i have a preference for 3.
>>> > >
>>> > > What do you think?
>>> >
>>> > String literals also have an inherent length. I'm mildly opposed to
>>> > normatively
>>> > specifying a required alteration of the "source-code-apparent" length
>>> for
>>> > types
>>> > whose encoding are not variable-width to start with (u8, u16). That
>>> > leaves
>>> > 1 and 2 for me.
>>> >
>>> > Jens
>>> >
>>> > --
>>> > SG16 mailing list
>>> > SG16_at_[hidden]
>>> >
>>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
>>> > sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
>>> > mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>



SG16 list run by sg16-owner@lists.isocpp.org