On 8/13/19 8:35 AM, Corentin Jabot wrote:

Chiming in with my favorite solution:
  • Forbid lossy source -> presumed execution encoding conversion (all ready ill formed in gcc but not msvc)
I think this may be reasonable.
  • Forbid u8/u16/u32 literals in non unicode encoded files
I don't understand this at all.  u8/u16/u32 specify the encoding to be used at run-time.  The source file encoding isn't relevant at all (as Steve noted, source file characters are converted to internal encoding).
This may be useful, but needs more justification (preferably in the form of a paper).

I would expect changing the encoding of char would break everything... I'd leave char and wchar_t mostly alone and start clean on char8_t.
I agree, but I don't think that will be suffiicent.  Not all projects are going to adopt char8_t.  A substantial portion, especially on Linux/UNIX systems will choose to continue use of UTF-8 using char.  I think we're going to have to provide Unicode support for char and char8_t (and char16_t, and perhaps char32_t).

Anyhow, I agree with Tom that the names are not indicative
How about: "narrow/wide character literal encoding" ?

"execution encoding" has a long history in both WG14 and WG21 (though not POSIX I think) and that makes me reluctant to try and challenge it.  In Slack, discussion, I think Steve Downey probably hit on the right approach; provide a formal definition of it.  I think we *might* be successful in using "execution encoding" to apply to both the compile-time and run-time encodings by extending the term with specific qualifiers; e.g., "presumed execution encoding" and "run-time/system/native execution encoding".

Tom.





On Tue, 13 Aug 2019 at 10:39, Niall Douglas <s_sourceforge@nedprod.com> wrote:
Before progressing with a solution, can I ask the question:

Is it politically feasible for C++ 23 and C 2x to require
implementations to default to interpreting source files as either (i) 7
bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
either 7 bit ASCII or UTF-8.

(The reason for the 7 bit ASCII is that it is a perfect subset of UTF-8,
and that C very much wants to retain the language being implementable in
a small code base i.e. without UTF-8 support. Note the qualifier
"default" as well)

An answer to the above would determine how best to solve your issue Tom,
I think. As much as we all expect IBM et al to veto such a proposal, one
never gets anywhere without asking first.

Niall

On 13/08/2019 03:25, Tom Honermann wrote:
> I agree with this (mostly), but would prefer not to discuss further in
> this thread.  The only reason I included the filesystem references is
> because the wording there uses "native" for an encoding that is related
> (though distinct) from the encodings referenced in the codecvt and ctype
> wording, where "native" is also used.  This suggests that "native"
> serves (or should serve) a role in naming these run-time encodings, or
> is a source of conflation (or both).
>
> Tom.
>
> On 8/12/19 5:08 PM, Niall Douglas wrote:
>>>   1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
>>>      (though the definition provided here appears to be specific to path
>>>      names).
>>>      "The /native encoding/ of an ordinary character string is the
>>>      operating system dependent current encoding for path names.  The
>>>      /native encoding/ for wide character strings is the
>>>      implementation-defined execution wide-character set encoding."
>> We discussed the problems with the choice of normative wording in
>> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
>> during SG16's discussion of filesystem::path_view.
>>
>> The problem is that filesystem paths have different encoding and
>> interpretation per-path-component i.e. for a path
>>
>> /A/B/C/D
>>
>> ... A, B, C and D may each have its own, individual, encoding and
>> interpretation depending on the mount points and filesystems configured
>> on the current system. This is not what is suggested by the current
>> normative wording, which appears to think that some mapping exists
>> between C++ paths and OS kernel paths.
>>
>> There *is* a mapping, but it is 100% C++-side. The OS kernel generally
>> consumes arrays of bytes.
>>
>> A more correct normative wording would more clearly separate these two
>> kinds of path representation. OS kernel paths are arrays of `byte`, but
>> with certain implementation-defined byte sequences not permitted. C++
>> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
>> there are well defined conversions between those C++ paths and the array
>> of bytes supplied to the OS kernel. The standard can say nothing useful
>> about how the OS kernel may interpret the byte array C++ supplies to it.
>>
>> If path_view starts the standards track, I'll need to propose a document
>> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in any case.
>> But to come back to your original question, I think that you ought to
>> split off filesystem paths from everything else, consider them separate,
>> and then I think you'll find it much easier to make the non-path
>> normative wording more consistent.
>>
>> Niall
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode@isocpp.open-std.org
>> http://www.open-std.org/mailman/listinfo/unicode
>
>
_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode

_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode