liaison: Re: [wg14/wg21 liaison] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Wed, 14 Aug 2019 11:39:42 +0100

Removed CC to Core, as per Tom's request.

> I agree with you that reinterpreting all existing code overnight as
> utf-8 would hinder the adoption of future c++ version enough that we
> should probably avoid to do that, but maybe a slight encouragement to
> use utf8 would be beneficial to everyone.

I don't personally think it's a big ask for people to convert their
source files into UTF-8 when they flip the compiler language standard
version into C++ 23, *if they don't tell the compiler to interpret the
source code in a different way*. As I mentioned in a previous post, even
very complex multi-encoded legacy codebases can be upgraded via Python.
Just invest the effort, upgrade your code, clear the tech debt. Same as
everyone must do with every C++ standard version upgrade.

Far more importantly, if the committee can assume unicode-clean source
code going forth, that makes far more tractable lots of other problems
such as how char string literals ought to be interpreted.

Right now there is conflation in this discussion between two types of
char string:

1. char strings which come from the runtime environment e.g. from
argv[], which can be ANY arbitrary encoding, including arbitrary bits.

2. char strings which come from the compile time environment with
compiler-imposed expectations of encoding e.g. from __FILE__

3. char strings which come from the compiler time environment with
arbitrary encoding and bits e.g. escaped characters inside string literals.

This conflation is not helping the discussion get anywhere useful
quickly. For example, one obvious solution to the above is that string
literals gain a type of char8_maybe_t if they don't contain anything
UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
char.

Various people have objected to my proposal on strawman grounds e.g. "my
code would break". Firstly, if that is the case, your code is probably
*already* broken, and "just happens" to work on your particular
toolchain version. It won't be portable, in any case.

Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
probably unavoidable in the long run in any case, because the
preprocessor also needs to know encoding. Anybody who has wrestled with
files #including files of differing encoding, but insufficiently
different that the compiler can't auto-detect the disparate encoding,
will know what I mean. Far worse happens again when macros with content
from one encoding are expanded into files with different encoding.

The current situation of letting everybody do what they want is a mess.
That's what standardisation is for: imposition of order upon chaos.

Just make the entire lot UTF-8! And let individual files opt-out if they
want, or whole TUs if the user asks the compiler to do so, with the
standard making it very clear that anything other than UTF-8 =
implementation defined behaviour for C++ 23 onwards.

Niall

Received on 2019-08-14 05:41:48