Subject: Re: Towards a better description of the execution encoding
From: Tom Honermann (tom_at_[hidden])
Date: 2021-03-02 09:34:09
On 3/2/21 4:35 AM, Corentin via SG16 wrote:
> On Mon, Mar 1, 2021 at 10:32 PM Hubert Tong
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
> On Mon, Mar 1, 2021 at 10:24 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> Hey folks!
> Last meeting we talked about the relation between the literal
> & execution encoding.
> I think there is pressureÂ to solve this issue (encoding names,
> std::print, other features).
> In P2297, I suggested that we say the execution character set
> is a superset of the literal character set, such that any
> character in the literal character set results in theÂ same
> code unit sequence
> whether it is encoded in theÂ literal encoding or execution
> Hubert was concerned this was too restrictive because some
> ebcdic & isoÂ 646 have codepoints reserved for "nationalÂ symbols".
> EvenÂ Shift-JIS is not 100% ascii compatible (Yen instead of
> backslash, overline instead of tilde)
> I've been thinking about that over the past few days, I think
> the solution is to not have requirementsÂ on the literal
> character set but rather on the literals themselves.
> If the execution encoding is UTF8, "ABC" is interpreted
> identically whether its encoding is ASCII,Â ISO 646-IT, or
> However, "C:\\" would be interpreted asÂ Â "C:\\",Â "C:Ã§" and
> "C:Â¥" respectively.
> So we need to only put requirementsÂ on the content of
> individual literals rather than on the entiere literal set
> (which, P1885 non whistanding, is not observable during
> execution anyhow)
> *A way to word that:*
> The execution encoding is the locale-specific encoding usedÂ to
> interpret character and NTMBSÂ parameters in character
> functions, multibyte characters functions and other
> locale-specific functions.
> I think we can start from something like this. I am guessing that
> the parallel treatment for wide strings is intended?
> Although, do we know of platforms where the literal and execution wide
> encoding would be different?
We do.Â From
AIX globalization, Code sets for multicultural support, Data
representation, Wide character data representation (page 45):
> On the AIX operating system, the *wchar_t* data type is 32âbit in the
> 64âbit environment and 16âbit in the
> 32âbit environment. The locale methods are standardized such that in
> most locales, the value that is
> stored in the *wchar_t* for a particular character is always its
> Unicode data value. For applications that are
> intended to run only on AIX, it allows certain applications to handle
> the *wchar_t* data type in a consistent
> fashion, even if the underlying code set is unknown. All locales use
> Unicode for their wide character code
> values (process code), except the IBM-eucTW code set. The IBM-eucTW
> code set (LANG =*zh_TW*)
> contains many characters that are not contained in the Unicode
> standard. As a result, it is impossible to
> represent these characters with a Unicode-wide character value.
> Applications that are required to have
> Unicode-based *wchar_t* data for Traditional Chinese must use the
> *Zh_TW* locale (big5 code set) instead.
SG16 list run by firstname.lastname@example.org