sg16: [SG16-Unicode] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 12 Aug 2019 12:09:28 -0400

I (and SG16 in general) have been using the term "execution character
set" and "execution encoding" to refer to both the encoding known at
compile-time that is used to encode character and string literals and
the locale dependent encoding specified by the LC_CTYPE locale category
that is used at run-time by the character classification and conversion
functions. When necessary to avoid confusion, I've been referring to
the former as the "presumed execution encoding" and the latter as simply
the "run-time execution encoding".

A discussion
<https://www.reddit.com/r/cpp/comments/bfyp6x/overview_of_stdfilesystem_my_talk/>
[1] with user 'alfps' on an r/cpp Reddit thread alerted me to the
possibility that I/we have been using this term incorrectly. I spent
some time looking at both the C and C++ standards and there does appear
to be evidence that "execution character set" (encoding) refers solely
to the encoding known at compile-time that is used to encode literals.
But there doesn't seem to be a clear term defined for the locale
dependent run-time encoding that governs the behavior of the character
classification and conversion functions. There is some evidence for
this encoding being referred to using the term "native".

From the C++ standard:

1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
    (though the definition provided here appears to be specific to path
    names).
    "The /native encoding/ of an ordinary character string is the
    operating system dependent current encoding for path names. The
    /native encoding/ for wide character strings is the
    implementation-defined execution wide-character set encoding."
2. [fs.path.type.cvt]p2.1
    <http://eel.is/c++draft/fs.path.type.cvt#2.1>: (This paragraph, the
    next one, and p8 (not listed here) constitute the only uses of
    "native (ordinary|wide) encoding" in the C++ standard).
    "char: The encoding is the native ordinary encoding. ..."
3. [fs.path.type.cvt]p2.2 <http://eel.is/c++draft/fs.path.type.cvt#2.2>:
    "wchar_t: The encoding is the native wide encoding. ..."
4. [locale.codecvt]p3 <http://eel.is/c++draft/locale.codecvt#3>:
    "The specializations required in Table 101 ([locale.category])
    convert the implementation-defined native character set. ...
    codecvt<wchar_t, char, mbstate_t> converts between the native
    character sets for ordinary and wide characters. ..."
5. [locale.ctype]p2 <http://eel.is/c++draft/category.ctype#locale.ctype-2>:
    "The specializations required in Table 101 ([locale.category]),
    namely ctype<char> and ctype<wchar_t>, implement character classing
    appropriate to the implementation's native character set."

As far as I can tell, none of the highlighted terms above appear in the
C17 standard, but "native environment" appears in a related wording:

  * 7.11.1.1p3 "The setlocale function":
    "A value of "C" for locale specifies the minimal environment for C
    translation; a value of "" for locale specifies the locale-specific
    native environment. Other implementation-defined strings may be
    passed as the second argument to setlocale."

C17 suggests that "extended character set" may also be the right term:

  * 7.22p3 "General utilities <stdlib.h>":
    "... that is the maximum number of bytes in a multibyte character
    for the extended character set specified by the current locale
    (category LC_CTYPE), which is never greater than MB_LEN_MAX."

However, the C++ standard states (non-normatively) that the "extended
character set" extends the basic source character set and (normatively)
that it applies to both the source and execution character sets:

  * [defns.multibyte] <http://eel.is/c++draft/intro.defs#defns.multibyte>:
    "[ Note: The extended character set is a superset of the basic
    character set ([lex.charset]). — end note ]"
  * [lex.phases]p1 <http://eel.is/c++draft/lex.phases#1.1>:
    "... An implementation may use any internal encoding, so long as an
    actual extended character encountered in the source file, and the
    same extended character expressed in the source file as a
    universal-character-name (e.g., using the \uXXXX notation), are
    handled equivalently except where this replacement is reverted
    ([lex.pptoken]) in a raw string literal."
  * [basic.fundamental]p8 <http://eel.is/c++draft/basic.fundamental#8>:
    "... The values of type wchar_t can represent distinct codes for
    all members of the largest extended character set specified among
    the supported locales ([locale])."

So, what term should we be using here? Perhaps a core issue should be
opened for this? A brief search didn't reveal an existing one.

(note: you may need to click "continue this thread" when reading the
Reddit thread to see all relevant comments).

Tom.

[1]:
https://www.reddit.com/r/cpp/comments/bfyp6x/overview_of_stdfilesystem_my_talk/

Received on 2019-08-12 18:09:34