Subject: Re: [SG16-Unicode] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Tom Honermann (tom_at_[hidden])
Date: 2019-08-13 21:39:23
On 8/13/19 8:35 AM, Corentin Jabot wrote:
> Chiming in with my favorite solution:
> * Forbid lossy source -> presumed execution encoding conversion (all
> ready ill formed in gcc but not msvc)
I think this may be reasonable.
> * Forbid u8/u16/u32 literals in non unicode encoded files
I don't understand this at all.Â u8/u16/u32 specify the encoding to be
used at run-time.Â The source file encoding isn't relevant at all (as
Steve noted, source file characters are converted to internal encoding).
> * Expose the "presumed execution encoding" (= "narrow/wide character
> literal encoding") as a consteval function returning the name as
> specified by iana
This may be useful, but needs more justification (preferably in the form
of a paper).
> I would expect changing the encoding of char would break everything...
> I'd leave char and wchar_t mostly alone and start clean on char8_t.
I agree, but I don't think that will be suffiicent.Â Not all projects
are going to adopt char8_t.Â A substantial portion, especially on
Linux/UNIX systems will choose to continue use of UTF-8 using char.Â I
think we're going to have to provide Unicode support for char and
char8_t (and char16_t, and perhaps char32_t).
> Anyhow, I agree with Tom that the names are not indicative
> How about: "narrow/wide character literal encoding" ?
"execution encoding" has a long history in both WG14 and WG21 (though
not POSIX I think) and that makes me reluctant to try and challenge it.Â
In Slack, discussion, I think Steve Downey probably hit on the right
approach; provide a formal definition of it.Â I think we *might* be
successful in using "execution encoding" to apply to both the
compile-time and run-time encodings by extending the term with specific
qualifiers; e.g., "presumed execution encoding" and
"run-time/system/native execution encoding".
> On Tue, 13 Aug 2019 at 10:39, Niall Douglas <s_sourceforge_at_[hidden]
> <mailto:s_sourceforge_at_[hidden]>> wrote:
> Before progressing with a solution, can I ask the question:
> Is it politically feasible for C++ 23 and C 2x to require
> implementations to default to interpreting source files as either
> (i) 7
> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
> either 7 bit ASCII or UTF-8.
> (The reason for the 7 bit ASCII is that it is a perfect subset of
> and that C very much wants to retain the language being
> implementable in
> a small code base i.e. without UTF-8 support. Note the qualifier
> "default" as well)
> An answer to the above would determine how best to solve your
> issue Tom,
> I think. As much as we all expect IBM et al to veto such a
> proposal, one
> never gets anywhere without asking first.
> On 13/08/2019 03:25, Tom Honermann wrote:
> > I agree with this (mostly), but would prefer not to discuss
> further in
> > this thread.Â The only reason I included the filesystem
> references is
> > because the wording there uses "native" for an encoding that is
> > (though distinct) from the encodings referenced in the codecvt
> and ctype
> > wording, where "native" is also used.Â This suggests that "native"
> > serves (or should serve) a role in naming these run-time
> encodings, or
> > is a source of conflation (or both).
> > Tom.
> > On 8/12/19 5:08 PM, Niall Douglas wrote:
> >>> Â 1. [fs.path.type.cvt]p1
> >>> Â Â Â Â (though the definition provided here appears to be
> specific to path
> >>> Â Â Â Â names).
> >>> Â Â Â Â "The /native encoding/ of an ordinary character string is the
> >>> Â Â Â Â operating system dependent current encoding for path
> names.Â The
> >>> Â Â Â Â /native encoding/ for wide character strings is the
> >>> Â Â Â Â implementation-defined execution wide-character set
> >> We discussed the problems with the choice of normative wording in
> >> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
> >> during SG16's discussion of filesystem::path_view.
> >> The problem is that filesystem paths have different encoding and
> >> interpretation per-path-component i.e. for a path
> >> /A/B/C/D
> >> ... A, B, C and D may each have its own, individual, encoding and
> >> interpretation depending on the mount points and filesystems
> >> on the current system. This is not what is suggested by the current
> >> normative wording, which appears to think that some mapping exists
> >> between C++ paths and OS kernel paths.
> >> There *is* a mapping, but it is 100% C++-side. The OS kernel
> >> consumes arrays of bytes.
> >> A more correct normative wording would more clearly separate
> these two
> >> kinds of path representation. OS kernel paths are arrays of
> `byte`, but
> >> with certain implementation-defined byte sequences not
> permitted. C++
> >> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
> >> there are well defined conversions between those C++ paths and
> the array
> >> of bytes supplied to the OS kernel. The standard can say
> nothing useful
> >> about how the OS kernel may interpret the byte array C++
> supplies to it.
> >> If path_view starts the standards track, I'll need to propose a
> >> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in
> any case.
> >> But to come back to your original question, I think that you
> ought to
> >> split off filesystem paths from everything else, consider them
> >> and then I think you'll find it much easier to make the non-path
> >> normative wording more consistent.
> >> Niall
> >> _______________________________________________
> >> SG16 Unicode mailing list
> >> Unicode_at_[hidden] <mailto:Unicode_at_[hidden]>
> >> http://www.open-std.org/mailman/listinfo/unicode
> SG16 Unicode mailing list
> Unicode_at_[hidden] <mailto:Unicode_at_[hidden]>
> SG16 Unicode mailing list
SG16 list run by firstname.lastname@example.org