C++ Logo

SG16

Advanced search

Subject: Re: Conversion of grapheme clusters to (wide) execution encoding
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2020-06-01 10:47:40


On Mon, Jun 1, 2020 at 10:13 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Mon, Jun 1, 2020, 16:04 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Mon, Jun 1, 2020 at 5:22 AM Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Mon, 1 Jun 2020 at 09:48, Peter Brett <pbrett_at_[hidden]> wrote:
>>>
>>>> Out of the 5 options laid out, I feel it would be best to make it
>>>> ill-formed. The source code author's intent is (at best) ambiguous, and
>>>> the onus should not be placed on the compiler to try to make it work. The
>>>> status quo allows for programs to be broken in surprising ways.
>>>>
>>>
>>> I think you might be right.
>>> Especially given making it work add implementation burden for the sake
>>> of legacy encodings and there is a somewhat easy fix: nfc normalize your
>>> sources.
>>>
>> Wait, does this apply to u, U, and u8 strings? Users can't have
>> non-NFC-normalized strings?
>>
>
> Well, that was the original question:
> If the source is nfd normalized, and the execution encoding is not a
> Unicode encoding, what should be done about combining characters?
>
Okay. I guess I got the answer. For "plain" strings, there is some desire
to make some problematic sequences go away. The perceived badness goes away
with NFC normalization and users applying a broad NFC normalization upon
their Unicode source may suffer loss to the integrity of their Unicode
literals.

The proposed ill-formedness is rather dependent on how "characters" in
source code are identified though. Given the intent to preserve (no
specific) normalization and therefore non-normalization to maintain the
integrity of Unicode literals during the phases of translation, by
"characters" we do mean (for UTF-8 source) UCS scalar values.

>
> And yes, normalizing the source does not preserve Unicode literals.
>
> Question is, is that a reasonable work around?
>

> As for normalization, I am writing wording to make sure normalization is
> preserved in phase 1 when applicable.
>
>
>
>
>>
>>>
>>>
>>>>
>>>> Best regards,
>>>>
>>>> Peter
>>>>
>>>> > -----Original Message-----
>>>> > From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer
>>>> via
>>>> > SG16
>>>> > Sent: 01 June 2020 07:10
>>>> > To: sg16_at_[hidden]
>>>> > Cc: Jens Maurer <Jens.Maurer_at_[hidden]>; Corentin <
>>>> corentin.jabot_at_[hidden]>
>>>> > Subject: Re: [SG16] Conversion of grapheme clusters to (wide)
>>>> execution
>>>> > encoding
>>>> >
>>>> > EXTERNAL MAIL
>>>> >
>>>> >
>>>> > On 01/06/2020 00.21, Corentin via SG16 wrote:
>>>> > > Hello
>>>> > >
>>>> > > Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE
>>>> > ACCENT).
>>>> > >
>>>> > > There is some consensus in SG-16 that this should not be normalized
>>>> in
>>>> > phase 1, or in phase 5 if the execution encoding of that
>>>> string literal
>>>> > encode the Unicode character set.
>>>> > > However, what should happen if the execution character set is Latin
>>>> 1,
>>>> > for example?
>>>> > >
>>>> > > ACUTE ACCENT does not have an implementation in latin 1, but the
>>>> > grapheme cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN
>>>> SMALL
>>>> > LETTER E WITH ACUTE has a representation in the latin character set
>>>> > >
>>>> > > This is currently implementation defined ("e?" in msvc, ill-formed
>>>> in
>>>> > GCC and Clang), but the wording is specific about the conversion
>>>> happening
>>>> > independently for each code point.
>>>> > >
>>>> > > I think we have several options:
>>>> > >
>>>> > > 1. Status quo
>>>> > > 2. Making the conversion ill formed as per P1854R0 Conversion to
>>>> > execution encoding should not lead to loss of
>>>> > meaning
>>>> https://urldefense.com/v3/__https://wg21.link/p1854r__;!!EHscmS1yg
>>>> > iU1lA!QIugkkaP5mbGSB4ocs-mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6Bsaa14-Ufg$
>>>> > > 3. Allowing an implementation to transform each abstract character
>>>> to
>>>> > another abstract character represented by more of fewer code points
>>>> > > 4. Forcing an implementation to transform each abstract character
>>>> to
>>>> > another abstract character represented by more of fewer code points.
>>>> > > 5. Conversion to NFC(K?) before conversion to a non unicode
>>>> character
>>>> > set, but that may maybe introduce further issues and adds burden on
>>>> > implementation
>>>> > >
>>>> > >
>>>> > > Option 4 seems hardly implementable in all cases.
>>>> > > Option 2 and 5 offer the most consistency across implementations
>>>> > > Option 3, 4, 5 may be a behavior change
>>>> > >
>>>> > > I think i have a preference for 3.
>>>> > >
>>>> > > What do you think?
>>>> >
>>>> > String literals also have an inherent length. I'm mildly opposed to
>>>> > normatively
>>>> > specifying a required alteration of the "source-code-apparent" length
>>>> for
>>>> > types
>>>> > whose encoding are not variable-width to start with (u8, u16). That
>>>> > leaves
>>>> > 1 and 2 for me.
>>>> >
>>>> > Jens
>>>> >
>>>> > --
>>>> > SG16 mailing list
>>>> > SG16_at_[hidden]
>>>> >
>>>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/
>>>> > sg16__;!!EHscmS1ygiU1lA!QIugkkaP5mbGSB4ocs-
>>>> > mkv_CIGbTOKMblzUfGVYDOBbFRaZU6Uu6BsZdqrBLSw$
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>



SG16 list run by sg16-owner@lists.isocpp.org