On 8/14/19 6:39 AM, Niall Douglas wrote:
Removed CC to Core, as per Tom's request.

I agree with you that reinterpreting all existing code overnight as
utf-8 would hinder the adoption of future c++ version enough that we
should probably avoid to do that, but maybe a slight encouragement to
use utf8 would be beneficial to everyone.
I don't personally think it's a big ask for people to convert their
source files into UTF-8 when they flip the compiler language standard
version into C++ 23, *if they don't tell the compiler to interpret the
source code in a different way*. As I mentioned in a previous post, even
very complex multi-encoded legacy codebases can be upgraded via Python.
Just invest the effort, upgrade your code, clear the tech debt. Same as
everyone must do with every C++ standard version upgrade.

I strongly disagree.  Having to convert system headers for example would be problematic.  Granted, on ASCII based systems, use of characters outside ASCII in system headers is rare, but on EBCDIC based systems, this is a complete non-starter.

Far more importantly, if the committee can assume unicode-clean source
code going forth, that makes far more tractable lots of other problems
such as how char string literals ought to be interpreted.

I agree that having everyone to things the same way makes some things easier.

However, I think you may be confused here.  Source file encoding has nothing to do with how string literals are interpreted.  If the compiler's assumption of source file encoding is wrong, then mojibake ensues and the problems are not limited to character and string literals.

Compilers operate by (often under the as-if rule) converting source files from source file encoding to an internal encoding (likely UTF-8) before parsing the code.  The standard describes this in [lex.phases]p1. This is why "é" and "\u00E9" are indistinguishable (except in a raw string literal ([lex.pptoken]p3.1) or when source file encoding assumptions are wrong).

Right now there is conflation in this discussion between two types of
char string:

1. char strings which come from the runtime environment e.g. from
argv[], which can be ANY arbitrary encoding, including arbitrary bits.

2. char strings which come from the compile time environment with
compiler-imposed expectations of encoding e.g. from __FILE__

3. char strings which come from the compiler time environment with
arbitrary encoding and bits e.g. escaped characters inside string literals.
This existing conflation is the motivation for this discussion; that we don't seem to have clear terms for distinguishing these.

This conflation is not helping the discussion get anywhere useful
quickly. For example, one obvious solution to the above is that string
literals gain a type of char8_maybe_t if they don't contain anything
UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
This discussion is not about the well-formedness of encoded data.
Various people have objected to my proposal on strawman grounds e.g. "my
code would break". Firstly, if that is the case, your code is probably
*already* broken, and "just happens" to work on your particular
toolchain version. It won't be portable, in any case.
I strongly disagree.  Not all code is intended to be portable and relying on implementation defined behavior does not make code broken.

Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
probably unavoidable in the long run in any case, because the
preprocessor also needs to know encoding. Anybody who has wrestled with
files #including files of differing encoding, but insufficiently
different that the compiler can't auto-detect the disparate encoding,
will know what I mean. Far worse happens again when macros with content
from one encoding are expanded into files with different encoding.

The current situation of letting everybody do what they want is a mess.
That's what standardisation is for: imposition of order upon chaos.
I agree with this.

Just make the entire lot UTF-8!
I don't think this is realistic or feasible, at least not for all platforms.
And let individual files opt-out if they
want, or whole TUs if the user asks the compiler to do so, with the
standard making it very clear that anything other than UTF-8 =
implementation defined behaviour for C++ 23 onwards.

This is already the status quo.  If something is implementation defined, then it is implementation defined.


SG16 Unicode mailing list