C++ Logo

sg16

Advanced search

[SG16] Conversion of grapheme clusters to (wide) execution encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 00:21:46 +0200
Hello

Consider a string literal "e\u00B4" (LATIN SMALL LETTER E, ACUTE ACCENT).

There is some consensus in SG-16 that this should not be normalized in
phase 1, or in phase 5 if the execution encoding of that string literal
encode the Unicode character set.
However, what should happen if the execution character set is Latin 1, for
example?

ACUTE ACCENT does not have an implementation in latin 1, but the grapheme
cluster LATIN SMALL LETTER E, ACUTE ACCENT does as LATIN SMALL LETTER E
WITH ACUTE has a representation in the latin character set

This is currently implementation defined ("e?" in msvc, ill-formed in GCC
and Clang), but the wording is specific about the conversion happening
independently for each code point.

I think we have several options:

   1. Status quo
   2. Making the conversion ill formed as per P1854R0 Conversion to
   execution encoding should not lead to loss of meaning
   https://wg21.link/p1854r
   3. Allowing an implementation to transform each abstract character to
   another abstract character represented by more of fewer code points
   4. Forcing an implementation to transform each abstract character to
   another abstract character represented by more of fewer code points.
   5. Conversion to NFC(K?) before conversion to a non unicode character
   set, but that may maybe introduce further issues and adds burden on
   implementation


Option 4 seems hardly implementable in all cases.
Option 2 and 5 offer the most consistency across implementations
Option 3, 4, 5 may be a behavior change

I think i have a preference for 3.

What do you think?

Received on 2020-05-31 17:25:08