Date: Tue, 13 Aug 2019 14:35:03 +0200
Chiming in with my favorite solution:
- Forbid lossy source -> presumed execution encoding conversion (all
ready ill formed in gcc but not msvc)
- Forbid u8/u16/u32 literals in non unicode encoded files
- Expose the "presumed execution encoding" (= "narrow/wide character
literal encoding") as a consteval function returning the name as specified
by iana
https://www.iana.org/assignments/character-sets/character-sets.txt
I would expect changing the encoding of char would break everything... I'd
leave char and wchar_t mostly alone and start clean on char8_t.
Anyhow, I agree with Tom that the names are not indicative
How about: "narrow/wide character literal encoding" ?
On Tue, 13 Aug 2019 at 10:39, Niall Douglas <s_sourceforge_at_[hidden]>
wrote:
> Before progressing with a solution, can I ask the question:
>
> Is it politically feasible for C++ 23 and C 2x to require
> implementations to default to interpreting source files as either (i) 7
> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
> either 7 bit ASCII or UTF-8.
>
> (The reason for the 7 bit ASCII is that it is a perfect subset of UTF-8,
> and that C very much wants to retain the language being implementable in
> a small code base i.e. without UTF-8 support. Note the qualifier
> "default" as well)
>
> An answer to the above would determine how best to solve your issue Tom,
> I think. As much as we all expect IBM et al to veto such a proposal, one
> never gets anywhere without asking first.
>
> Niall
>
> On 13/08/2019 03:25, Tom Honermann wrote:
> > I agree with this (mostly), but would prefer not to discuss further in
> > this thread. The only reason I included the filesystem references is
> > because the wording there uses "native" for an encoding that is related
> > (though distinct) from the encodings referenced in the codecvt and ctype
> > wording, where "native" is also used. This suggests that "native"
> > serves (or should serve) a role in naming these run-time encodings, or
> > is a source of conflation (or both).
> >
> > Tom.
> >
> > On 8/12/19 5:08 PM, Niall Douglas wrote:
> >>> 1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
> >>> (though the definition provided here appears to be specific to
> path
> >>> names).
> >>> "The /native encoding/ of an ordinary character string is the
> >>> operating system dependent current encoding for path names. The
> >>> /native encoding/ for wide character strings is the
> >>> implementation-defined execution wide-character set encoding."
> >> We discussed the problems with the choice of normative wording in
> >> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
> >> during SG16's discussion of filesystem::path_view.
> >>
> >> The problem is that filesystem paths have different encoding and
> >> interpretation per-path-component i.e. for a path
> >>
> >> /A/B/C/D
> >>
> >> ... A, B, C and D may each have its own, individual, encoding and
> >> interpretation depending on the mount points and filesystems configured
> >> on the current system. This is not what is suggested by the current
> >> normative wording, which appears to think that some mapping exists
> >> between C++ paths and OS kernel paths.
> >>
> >> There *is* a mapping, but it is 100% C++-side. The OS kernel generally
> >> consumes arrays of bytes.
> >>
> >> A more correct normative wording would more clearly separate these two
> >> kinds of path representation. OS kernel paths are arrays of `byte`, but
> >> with certain implementation-defined byte sequences not permitted. C++
> >> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
> >> there are well defined conversions between those C++ paths and the array
> >> of bytes supplied to the OS kernel. The standard can say nothing useful
> >> about how the OS kernel may interpret the byte array C++ supplies to it.
> >>
> >> If path_view starts the standards track, I'll need to propose a document
> >> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in any case.
> >> But to come back to your original question, I think that you ought to
> >> split off filesystem paths from everything else, consider them separate,
> >> and then I think you'll find it much easier to make the non-path
> >> normative wording more consistent.
> >>
> >> Niall
> >> _______________________________________________
> >> SG16 Unicode mailing list
> >> Unicode_at_[hidden]
> >> http://www.open-std.org/mailman/listinfo/unicode
> >
> >
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
- Forbid lossy source -> presumed execution encoding conversion (all
ready ill formed in gcc but not msvc)
- Forbid u8/u16/u32 literals in non unicode encoded files
- Expose the "presumed execution encoding" (= "narrow/wide character
literal encoding") as a consteval function returning the name as specified
by iana
https://www.iana.org/assignments/character-sets/character-sets.txt
I would expect changing the encoding of char would break everything... I'd
leave char and wchar_t mostly alone and start clean on char8_t.
Anyhow, I agree with Tom that the names are not indicative
How about: "narrow/wide character literal encoding" ?
On Tue, 13 Aug 2019 at 10:39, Niall Douglas <s_sourceforge_at_[hidden]>
wrote:
> Before progressing with a solution, can I ask the question:
>
> Is it politically feasible for C++ 23 and C 2x to require
> implementations to default to interpreting source files as either (i) 7
> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
> either 7 bit ASCII or UTF-8.
>
> (The reason for the 7 bit ASCII is that it is a perfect subset of UTF-8,
> and that C very much wants to retain the language being implementable in
> a small code base i.e. without UTF-8 support. Note the qualifier
> "default" as well)
>
> An answer to the above would determine how best to solve your issue Tom,
> I think. As much as we all expect IBM et al to veto such a proposal, one
> never gets anywhere without asking first.
>
> Niall
>
> On 13/08/2019 03:25, Tom Honermann wrote:
> > I agree with this (mostly), but would prefer not to discuss further in
> > this thread. The only reason I included the filesystem references is
> > because the wording there uses "native" for an encoding that is related
> > (though distinct) from the encodings referenced in the codecvt and ctype
> > wording, where "native" is also used. This suggests that "native"
> > serves (or should serve) a role in naming these run-time encodings, or
> > is a source of conflation (or both).
> >
> > Tom.
> >
> > On 8/12/19 5:08 PM, Niall Douglas wrote:
> >>> 1. [fs.path.type.cvt]p1 <http://eel.is/c++draft/fs.path.type.cvt#1>:
> >>> (though the definition provided here appears to be specific to
> path
> >>> names).
> >>> "The /native encoding/ of an ordinary character string is the
> >>> operating system dependent current encoding for path names. The
> >>> /native encoding/ for wide character strings is the
> >>> implementation-defined execution wide-character set encoding."
> >> We discussed the problems with the choice of normative wording in
> >> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
> >> during SG16's discussion of filesystem::path_view.
> >>
> >> The problem is that filesystem paths have different encoding and
> >> interpretation per-path-component i.e. for a path
> >>
> >> /A/B/C/D
> >>
> >> ... A, B, C and D may each have its own, individual, encoding and
> >> interpretation depending on the mount points and filesystems configured
> >> on the current system. This is not what is suggested by the current
> >> normative wording, which appears to think that some mapping exists
> >> between C++ paths and OS kernel paths.
> >>
> >> There *is* a mapping, but it is 100% C++-side. The OS kernel generally
> >> consumes arrays of bytes.
> >>
> >> A more correct normative wording would more clearly separate these two
> >> kinds of path representation. OS kernel paths are arrays of `byte`, but
> >> with certain implementation-defined byte sequences not permitted. C++
> >> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
> >> there are well defined conversions between those C++ paths and the array
> >> of bytes supplied to the OS kernel. The standard can say nothing useful
> >> about how the OS kernel may interpret the byte array C++ supplies to it.
> >>
> >> If path_view starts the standards track, I'll need to propose a document
> >> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in any case.
> >> But to come back to your original question, I think that you ought to
> >> split off filesystem paths from everything else, consider them separate,
> >> and then I think you'll find it much easier to make the non-path
> >> normative wording more consistent.
> >>
> >> Niall
> >> _______________________________________________
> >> SG16 Unicode mailing list
> >> Unicode_at_[hidden]
> >> http://www.open-std.org/mailman/listinfo/unicode
> >
> >
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2019-08-13 14:35:17