sg16: Re: [SG16-Unicode] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Tue, 13 Aug 2019 09:38:48 +0100

Before progressing with a solution, can I ask the question:

Is it politically feasible for C++ 23 and C 2x to require
implementations to default to interpreting source files as either (i) 7
bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
either 7 bit ASCII or UTF-8.

(The reason for the 7 bit ASCII is that it is a perfect subset of UTF-8,
and that C very much wants to retain the language being implementable in
a small code base i.e. without UTF-8 support. Note the qualifier
"default" as well)

An answer to the above would determine how best to solve your issue Tom,
I think. As much as we all expect IBM et al to veto such a proposal, one
never gets anywhere without asking first.

Niall

On 13/08/2019 03:25, Tom Honermann wrote:
> I agree with this (mostly), but would prefer not to discuss further in
> this thread. The only reason I included the filesystem references is
> because the wording there uses "native" for an encoding that is related
> (though distinct) from the encodings referenced in the codecvt and ctype
> wording, where "native" is also used. This suggests that "native"
> serves (or should serve) a role in naming these run-time encodings, or
> is a source of conflation (or both).
>
> Tom.
>
> On 8/12/19 5:08 PM, Niall Douglas wrote:
>>> 1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
>>> (though the definition provided here appears to be specific to path
>>> names).
>>> "The /native encoding/ of an ordinary character string is the
>>> operating system dependent current encoding for path names. The
>>> /native encoding/ for wide character strings is the
>>> implementation-defined execution wide-character set encoding."
>> We discussed the problems with the choice of normative wording in
>> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
>> during SG16's discussion of filesystem::path_view.
>>
>> The problem is that filesystem paths have different encoding and
>> interpretation per-path-component i.e. for a path
>>
>> /A/B/C/D
>>
>> ... A, B, C and D may each have its own, individual, encoding and
>> interpretation depending on the mount points and filesystems configured
>> on the current system. This is not what is suggested by the current
>> normative wording, which appears to think that some mapping exists
>> between C++ paths and OS kernel paths.
>>
>> There *is* a mapping, but it is 100% C++-side. The OS kernel generally
>> consumes arrays of bytes.
>>
>> A more correct normative wording would more clearly separate these two
>> kinds of path representation. OS kernel paths are arrays of `byte`, but
>> with certain implementation-defined byte sequences not permitted. C++
>> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
>> there are well defined conversions between those C++ paths and the array
>> of bytes supplied to the OS kernel. The standard can say nothing useful
>> about how the OS kernel may interpret the byte array C++ supplies to it.
>>
>> If path_view starts the standards track, I'll need to propose a document
>> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in any case.
>> But to come back to your original question, I think that you ought to
>> split off filesystem paths from everything else, consider them separate,
>> and then I think you'll find it much easier to make the non-path
>> normative wording more consistent.
>>
>> Niall
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>
>

Received on 2019-08-13 10:38:59