On 3/2/21 4:35 AM, Corentin via SG16 wrote:


On Mon, Mar 1, 2021 at 10:32 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Mon, Mar 1, 2021 at 10:24 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:
Hey folks!
Last meeting we talked about the relation between the literal & execution encoding.

I think there is pressure to solve this issue (encoding names, std::print, other features).
In P2297, I suggested that we say the execution character set is a superset of the literal character set, such that any character in the literal character set results in the same code unit sequence
whether it is encoded in the literal encoding or execution encoding.

Hubert was concerned this was too restrictive because some ebcdic & iso 646 have codepoints reserved for "national symbols".
Even Shift-JIS is not 100% ascii compatible (Yen instead of backslash, overline instead of tilde)

I've been thinking about that over the past few days, I think the solution is to not have requirements on the literal character set but rather on the literals themselves.

If the execution encoding is UTF8, "ABC" is interpreted identically whether its encoding is ASCII, ISO 646-IT, or Shift-JS.

However, "C:\\" would be interpreted as  "C:\\", "C:ç" and "C:¥" respectively.

So we need to only put requirements on the content of individual literals rather than on the entiere literal set (which, P1885 non whistanding, is not observable during execution anyhow)


A way to word that:

The execution encoding is the locale-specific encoding used to interpret character and NTMBS parameters in character functions, multibyte characters functions and other locale-specific functions.
I think we can start from something like this. I am guessing that the parallel treatment for wide strings is intended?

Indeed!
Although, do we know of platforms where the literal and execution wide encoding would be different?

We do.  From https://www.ibm.com/support/knowledgecenter/ssw_aix_71/globalization/globalization_pdf.pdf, AIX globalization, Code sets for multicultural support, Data representation, Wide character data representation (page 45):

On the AIX operating system, the wchar_t data type is 32–bit in the 64–bit environment and 16–bit in the
32–bit environment. The locale methods are standardized such that in most locales, the value that is
stored in the wchar_t for a particular character is always its Unicode data value. For applications that are
intended to run only on AIX, it allows certain applications to handle the wchar_t data type in a consistent
fashion, even if the underlying code set is unknown. All locales use Unicode for their wide character code
values (process code), except the IBM-eucTW code set. The IBM-eucTW code set (LANG =zh_TW)
contains many characters that are not contained in the Unicode standard. As a result, it is impossible to
represent these characters with a Unicode-wide character value. Applications that are required to have
Unicode-based wchar_t data for Traditional Chinese must use the Zh_TW locale (big5 code set) instead.
Tom.