Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Steve Downey (sdowney_at_[hidden])
Date: 2019-08-12 13:01:57
I believe the wording in filesystem is a red herring. It's there to deal
with the fact that actual file systems, even on a single OS, will have
different notions of the encoding of paths. It's more related to a cooked
vs uncooked distinction. I certainly don't think there was intention for
the wording there to apply outside that part of the filesystem components
in the library.
I also believe that "execution character set" is used in opposition to the
"source character set", and it is applied to the translation of string
literals because that's when it comes up. On the other hand, this may be
pre-locale wording that has survived, at least partly because no one wants
to touch locale.
On Mon, Aug 12, 2019 at 12:09 PM Tom Honermann via Core <
> I (and SG16 in general) have been using the term "execution character set"
> and "execution encoding" to refer to both the encoding known at
> compile-time that is used to encode character and string literals and the
> locale dependent encoding specified by the LC_CTYPE locale category that is
> used at run-time by the character classification and conversion functions.
> When necessary to avoid confusion, I've been referring to the former as the
> "presumed execution encoding" and the latter as simply the "run-time
> execution encoding".
> A discussion
>  with user 'alfps' on an r/cpp Reddit thread alerted me to the
> possibility that I/we have been using this term incorrectly. I spent some
> time looking at both the C and C++ standards and there does appear to be
> evidence that "execution character set" (encoding) refers solely to the
> encoding known at compile-time that is used to encode literals. But there
> doesn't seem to be a clear term defined for the locale dependent run-time
> encoding that governs the behavior of the character classification and
> conversion functions. There is some evidence for this encoding being
> referred to using the term "native".
> From the C++ standard:
> 1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
> (though the definition provided here appears to be specific to path names).
> "The *native encoding* of an ordinary character string is the
> operating system dependent current encoding for path names. The *native
> encoding* for wide character strings is the implementation-defined
> execution wide-character set encoding."
> 2. [fs.path.type.cvt]p2.1 <http://eel.is/c++draft/fs.path.type.cvt#2.1>:
> (This paragraph, the next one, and p8 (not listed here) constitute the only
> uses of "native (ordinary|wide) encoding" in the C++ standard).
> "char: The encoding is the native ordinary encoding. ..."
> 3. [fs.path.type.cvt]p2.2 <http://eel.is/c++draft/fs.path.type.cvt#2.2>
> "wchar_Ât: The encoding is the native wide encoding. ..."
> 4. [locale.codecvt]p3 <http://eel.is/c++draft/locale.codecvt#3>:
> "The specializations required in Table 101 ([locale.category]) convert
> the implementation-defined native character set. ... codecvt<wchar_Ât,
> char, mbstate_Ât> converts between the native character sets for
> ordinary and wide characters. ..."
> 5. [locale.ctype]p2
> "The specializations required in Table 101 ([locale.category]), namely
> ctype<char> and ctype<wchar_Ât>, implement character classing
> appropriate to the implementation's native character set."
> As far as I can tell, none of the highlighted terms above appear in the
> C17 standard, but "native environment" appears in a related wording:
> - 188.8.131.52p3 "The setlocale function":
> "A value of "C" for locale specifies the minimal environment for C
> translation; a value of "" for locale specifies the locale-specific native
> environment. Other implementation-defined strings may be passed as the
> second argument to setlocale."
> C17 suggests that "extended character set" may also be the right term:
> - 7.22p3 "General utilities <stdlib.h>":
> "... that is the maximum number of bytes in a multibyte character for
> the extended character set specified by the current locale (category
> LC_CTYPE), which is never greater than MB_LEN_MAX."
> However, the C++ standard states (non-normatively) that the "extended
> character set" extends the basic source character set and (normatively)
> that it applies to both the source and execution character sets:
> - [defns.multibyte] <http://eel.is/c++draft/intro.defs#defns.multibyte>
> "[ Note: The extended character set is a superset of the basic
> character set ([lex.charset]). â end note ]"
> - [lex.phases]p1 <http://eel.is/c++draft/lex.phases#1.1>:
> "... An implementation may use any internal encoding, so long as an
> actual extended character encountered in the source file, and the same extended
> character expressed in the source file as a universal-character-name
> (e.g., using the \uXXXX notation), are handled equivalently except
> where this replacement is reverted ([lex.pptoken]) in a raw string literal."
> - [basic.fundamental]p8 <http://eel.is/c++draft/basic.fundamental#8>:
> "... The values of type wchar_Ât can represent distinct codes for all
> members of the largest extended character set specified among the
> supported locales ([locale])."
> So, what term should we be using here? Perhaps a core issue should be
> opened for this? A brief search didn't reveal an existing one.
> (note: you may need to click "continue this thread" when reading the
> Reddit thread to see all relevant comments).
> Core mailing list
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2019/08/7026.php
SG16 list run by firstname.lastname@example.org