Date: Wed, 14 Aug 2019 13:07:13 +0200
On Wed, Aug 14, 2019, 12:39 PM Niall Douglas <s_sourceforge_at_[hidden]>
wrote:
> Removed CC to Core, as per Tom's request.
>
> > I agree with you that reinterpreting all existing code overnight as
> > utf-8 would hinder the adoption of future c++ version enough that we
> > should probably avoid to do that, but maybe a slight encouragement to
> > use utf8 would be beneficial to everyone.
>
> I don't personally think it's a big ask for people to convert their
> source files into UTF-8 when they flip the compiler language standard
> version into C++ 23, *if they don't tell the compiler to interpret the
> source code in a different way*. As I mentioned in a previous post, even
> very complex multi-encoded legacy codebases can be upgraded via Python.
> Just invest the effort, upgrade your code, clear the tech debt. Same as
> everyone must do with every C++ standard version upgrade.
>
> Far more importantly, if the committee can assume unicode-clean source
> code going forth, that makes far more tractable lots of other problems
> such as how char string literals ought to be interpreted.
>
> Right now there is conflation in this discussion between two types of
> char string:
>
I don't think people (at least sg 16) are confused. The standard does
conflates everything. I think that's why Tom asked about the names of these
things to begin with.
1. char strings which come from the runtime environment e.g. from
> argv[], which can be ANY arbitrary encoding, including arbitrary bits.
>
> 2. char strings which come from the compile time environment with
> compiler-imposed expectations of encoding e.g. from __FILE__
>
> 3. char strings which come from the compiler time environment with
> arbitrary encoding and bits e.g. escaped characters inside string literals.
>
2 and 3 will have the same encoding. (Which will uterly fail when we try
to introduce Unicode identifiers and reflection).
>
> This conflation is not helping the discussion get anywhere useful
> quickly. For example, one obvious solution to the above is that string
> literals gain a type of char8_maybe_t if they don't contain anything
> UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
> char.
>
Maybe we have enough literal types
>
> Various people have objected to my proposal on strawman grounds e.g. "my
> code would break". Firstly, if that is the case, your code is probably
> *already* broken, and "just happens" to work on your particular
> toolchain version. It won't be portable, in any case.
>
Agreed. But whey I say these kinds of things people make funny faces. And
get annoyed to be pointed the brokeness of their code/the standard. So this
option seems out. Especially on windows where the system is not utf8
>
> Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
> probably unavoidable in the long run in any case, because the
> preprocessor also needs to know encoding. Anybody who has wrestled with
> files #including files of differing encoding, but insufficiently
> different that the compiler can't auto-detect the disparate encoding,
> will know what I mean. Far worse happens again when macros with content
> from one encoding are expanded into files with different encoding.
>
I don't see how the preprocessor factors into that, the mapping to internal
encoding is done before.
Also pragma doesn't help you mixing ebcdic and ASCII supersets
>
> The current situation of letting everybody do what they want is a mess.
>
Strongly agree.
That's what standardisation is for: imposition of order upon chaos.
>
> Just make the entire lot UTF-8! And let individual files opt-out if they
> want, or whole TUs if the user asks the compiler to do so, with the
> standard making it very clear that anything other than UTF-8 =
> implementation defined behaviour for C++ 23 onwards.
>
That is the pragmatic long term solution. But not the pragmatic short term
one. Wg21 favors the later it seems.
I would support such a thing. All other languages went there and it works
great for them. Python will for example assume utf8 in the absence of
pragma.
>
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
wrote:
> Removed CC to Core, as per Tom's request.
>
> > I agree with you that reinterpreting all existing code overnight as
> > utf-8 would hinder the adoption of future c++ version enough that we
> > should probably avoid to do that, but maybe a slight encouragement to
> > use utf8 would be beneficial to everyone.
>
> I don't personally think it's a big ask for people to convert their
> source files into UTF-8 when they flip the compiler language standard
> version into C++ 23, *if they don't tell the compiler to interpret the
> source code in a different way*. As I mentioned in a previous post, even
> very complex multi-encoded legacy codebases can be upgraded via Python.
> Just invest the effort, upgrade your code, clear the tech debt. Same as
> everyone must do with every C++ standard version upgrade.
>
> Far more importantly, if the committee can assume unicode-clean source
> code going forth, that makes far more tractable lots of other problems
> such as how char string literals ought to be interpreted.
>
> Right now there is conflation in this discussion between two types of
> char string:
>
I don't think people (at least sg 16) are confused. The standard does
conflates everything. I think that's why Tom asked about the names of these
things to begin with.
1. char strings which come from the runtime environment e.g. from
> argv[], which can be ANY arbitrary encoding, including arbitrary bits.
>
> 2. char strings which come from the compile time environment with
> compiler-imposed expectations of encoding e.g. from __FILE__
>
> 3. char strings which come from the compiler time environment with
> arbitrary encoding and bits e.g. escaped characters inside string literals.
>
2 and 3 will have the same encoding. (Which will uterly fail when we try
to introduce Unicode identifiers and reflection).
>
> This conflation is not helping the discussion get anywhere useful
> quickly. For example, one obvious solution to the above is that string
> literals gain a type of char8_maybe_t if they don't contain anything
> UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
> char.
>
Maybe we have enough literal types
>
> Various people have objected to my proposal on strawman grounds e.g. "my
> code would break". Firstly, if that is the case, your code is probably
> *already* broken, and "just happens" to work on your particular
> toolchain version. It won't be portable, in any case.
>
Agreed. But whey I say these kinds of things people make funny faces. And
get annoyed to be pointed the brokeness of their code/the standard. So this
option seems out. Especially on windows where the system is not utf8
>
> Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
> probably unavoidable in the long run in any case, because the
> preprocessor also needs to know encoding. Anybody who has wrestled with
> files #including files of differing encoding, but insufficiently
> different that the compiler can't auto-detect the disparate encoding,
> will know what I mean. Far worse happens again when macros with content
> from one encoding are expanded into files with different encoding.
>
I don't see how the preprocessor factors into that, the mapping to internal
encoding is done before.
Also pragma doesn't help you mixing ebcdic and ASCII supersets
>
> The current situation of letting everybody do what they want is a mess.
>
Strongly agree.
That's what standardisation is for: imposition of order upon chaos.
>
> Just make the entire lot UTF-8! And let individual files opt-out if they
> want, or whole TUs if the user asks the compiler to do so, with the
> standard making it very clear that anything other than UTF-8 =
> implementation defined behaviour for C++ 23 onwards.
>
That is the pragmatic long term solution. But not the pragmatic short term
one. Wg21 favors the later it seems.
I would support such a thing. All other languages went there and it works
great for them. Python will for example assume utf8 in the absence of
pragma.
>
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2019-08-14 13:07:27