On Wed, Aug 14, 2019, 12:39 PM Niall Douglas <s_sourceforge@nedprod.com> wrote:
Removed CC to Core, as per Tom's request.

> I agree with you that reinterpreting all existing code overnight as
> utf-8 would hinder the adoption of future c++ version enough that we
> should probably avoid to do that, but maybe a slight encouragement to
> use utf8 would be beneficial to everyone.

I don't personally think it's a big ask for people to convert their
source files into UTF-8 when they flip the compiler language standard
version into C++ 23, *if they don't tell the compiler to interpret the
source code in a different way*. As I mentioned in a previous post, even
very complex multi-encoded legacy codebases can be upgraded via Python.
Just invest the effort, upgrade your code, clear the tech debt. Same as
everyone must do with every C++ standard version upgrade.

Far more importantly, if the committee can assume unicode-clean source
code going forth, that makes far more tractable lots of other problems
such as how char string literals ought to be interpreted.

Right now there is conflation in this discussion between two types of
char string:

I don't think people (at least sg 16) are confused. The standard does conflates everything. I think that's why Tom asked about the names of these things to begin with.

1. char strings which come from the runtime environment e.g. from
argv[], which can be ANY arbitrary encoding, including arbitrary bits.

2. char strings which come from the compile time environment with
compiler-imposed expectations of encoding e.g. from __FILE__

3. char strings which come from the compiler time environment with
arbitrary encoding and bits e.g. escaped characters inside string literals.

2 and 3 will have the same encoding.  (Which will uterly fail when we try to introduce Unicode identifiers and reflection).

This conflation is not helping the discussion get anywhere useful
quickly. For example, one obvious solution to the above is that string
literals gain a type of char8_maybe_t if they don't contain anything
UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to

Maybe we have enough literal types

Various people have objected to my proposal on strawman grounds e.g. "my
code would break". Firstly, if that is the case, your code is probably
*already* broken, and "just happens" to work on your particular
toolchain version. It won't be portable, in any case.

Agreed. But whey I say these kinds of things people make funny faces. And get annoyed to be pointed the brokeness of their code/the standard. So this option seems out. Especially on windows where the system is not utf8

Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
probably unavoidable in the long run in any case, because the
preprocessor also needs to know encoding. Anybody who has wrestled with
files #including files of differing encoding, but insufficiently
different that the compiler can't auto-detect the disparate encoding,
will know what I mean. Far worse happens again when macros with content
from one encoding are expanded into files with different encoding.

I don't see how the preprocessor factors into that, the mapping to internal encoding is done before.

Also pragma doesn't help you mixing ebcdic and ASCII supersets

The current situation of letting everybody do what they want is a mess.

Strongly agree.

That's what standardisation is for: imposition of order upon chaos.

Just make the entire lot UTF-8! And let individual files opt-out if they
want, or whole TUs if the user asks the compiler to do so, with the
standard making it very clear that anything other than UTF-8 =
implementation defined behaviour for C++ 23 onwards.

That is the pragmatic long term solution. But not the pragmatic short term one. Wg21 favors the later it seems.

I would support such a thing. All  other languages went there and it works great for them. Python will for example assume utf8 in the absence of pragma.

SG16 Unicode mailing list