sg16: Re: [SG16] Towards a better description of the execution encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 2 Mar 2021 10:34:09 -0500

On 3/2/21 4:35 AM, Corentin via SG16 wrote:
>
>
> On Mon, Mar 1, 2021 at 10:32 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
>
> On Mon, Mar 1, 2021 at 10:24 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Hey folks!
> Last meeting we talked about the relation between the literal
> & execution encoding.
>
> I think there is pressure to solve this issue (encoding names,
> std::print, other features).
> In P2297, I suggested that we say the execution character set
> is a superset of the literal character set, such that any
> character in the literal character set results in the same
> code unit sequence
> whether it is encoded in the literal encoding or execution
> encoding.
>
> Hubert was concerned this was too restrictive because some
> ebcdic & iso 646 have codepoints reserved for "national symbols".
> Even Shift-JIS is not 100% ascii compatible (Yen instead of
> backslash, overline instead of tilde)
>
> I've been thinking about that over the past few days, I think
> the solution is to not have requirements on the literal
> character set but rather on the literals themselves.
>
> If the execution encoding is UTF8, "ABC" is interpreted
> identically whether its encoding is ASCII, ISO 646-IT, or
> Shift-JS.
>
> However, "C:\\" would be interpreted as "C:\\", "C:ç" and
> "C:¥" respectively.
>
> So we need to only put requirements on the content of
> individual literals rather than on the entiere literal set
> (which, P1885 non whistanding, is not observable during
> execution anyhow)
>
>
> *A way to word that:*
>
> The execution encoding is the locale-specific encoding used to
> interpret character and NTMBS parameters in character
> functions, multibyte characters functions and other
> locale-specific functions.
>
> I think we can start from something like this. I am guessing that
> the parallel treatment for wide strings is intended?
>
>
> Indeed!
> Although, do we know of platforms where the literal and execution wide
> encoding would be different?

We do. From
https://www.ibm.com/support/knowledgecenter/ssw_aix_71/globalization/globalization_pdf.pdf,
AIX globalization, Code sets for multicultural support, Data
representation, Wide character data representation (page 45):

> On the AIX operating system, the *wchar_t* data type is 32–bit in the
> 64–bit environment and 16–bit in the
> 32–bit environment. The locale methods are standardized such that in
> most locales, the value that is
> stored in the *wchar_t* for a particular character is always its
> Unicode data value. For applications that are
> intended to run only on AIX, it allows certain applications to handle
> the *wchar_t* data type in a consistent
> fashion, even if the underlying code set is unknown. All locales use
> Unicode for their wide character code
> values (process code), except the IBM-eucTW code set. The IBM-eucTW
> code set (LANG =*zh_TW*)
> contains many characters that are not contained in the Unicode
> standard. As a result, it is impossible to
> represent these characters with a Unicode-wide character value.
> Applications that are required to have
> Unicode-based *wchar_t* data for Traditional Chinese must use the
> *Zh_TW* locale (big5 code set) instead.
Tom.

Received on 2021-03-02 09:34:13