sg16: Re: [SG16-Unicode] [isocpp-core] Source file encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Aug 2019 09:37:30 -0400

On 8/14/19 6:39 AM, Niall Douglas wrote:
> Removed CC to Core, as per Tom's request.
>
>> I agree with you that reinterpreting all existing code overnight as
>> utf-8 would hinder the adoption of future c++ version enough that we
>> should probably avoid to do that, but maybe a slight encouragement to
>> use utf8 would be beneficial to everyone.
> I don't personally think it's a big ask for people to convert their
> source files into UTF-8 when they flip the compiler language standard
> version into C++ 23, *if they don't tell the compiler to interpret the
> source code in a different way*. As I mentioned in a previous post, even
> very complex multi-encoded legacy codebases can be upgraded via Python.
> Just invest the effort, upgrade your code, clear the tech debt. Same as
> everyone must do with every C++ standard version upgrade.

I strongly disagree. Having to convert system headers for example would
be problematic. Granted, on ASCII based systems, use of characters
outside ASCII in system headers is rare, but on EBCDIC based systems,
this is a complete non-starter.

>
> Far more importantly, if the committee can assume unicode-clean source
> code going forth, that makes far more tractable lots of other problems
> such as how char string literals ought to be interpreted.

I agree that having everyone to things the same way makes some things
easier.

However, I think you may be confused here. Source file encoding has
nothing to do with how string literals are interpreted. If the
compiler's assumption of source file encoding is wrong, then mojibake
ensues and the problems are not limited to character and string literals.

Compilers operate by (often under the as-if rule) converting source
files from source file encoding to an internal encoding (likely UTF-8)
before parsing the code. The standard describes this in [lex.phases]p1
<http://eel.is/c++draft/lex.phases#1.1>. This is why "é" and "\u00E9"
are indistinguishable (except in a raw string literal ([lex.pptoken]p3.1
<http://eel.is/c++draft/lex.pptoken#3.1>) or when source file encoding
assumptions are wrong).

>
> Right now there is conflation in this discussion between two types of
> char string:
>
> 1. char strings which come from the runtime environment e.g. from
> argv[], which can be ANY arbitrary encoding, including arbitrary bits.
>
> 2. char strings which come from the compile time environment with
> compiler-imposed expectations of encoding e.g. from __FILE__
>
> 3. char strings which come from the compiler time environment with
> arbitrary encoding and bits e.g. escaped characters inside string literals.
This existing conflation is the motivation for this discussion; that we
don't seem to have clear terms for distinguishing these.
>
> This conflation is not helping the discussion get anywhere useful
> quickly. For example, one obvious solution to the above is that string
> literals gain a type of char8_maybe_t if they don't contain anything
> UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
> char.
This discussion is not about the well-formedness of encoded data.
> Various people have objected to my proposal on strawman grounds e.g. "my
> code would break". Firstly, if that is the case, your code is probably
> *already* broken, and "just happens" to work on your particular
> toolchain version. It won't be portable, in any case.
I strongly disagree. Not all code is intended to be portable and
relying on implementation defined behavior does not make code broken.
>
> Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
> probably unavoidable in the long run in any case, because the
> preprocessor also needs to know encoding. Anybody who has wrestled with
> files #including files of differing encoding, but insufficiently
> different that the compiler can't auto-detect the disparate encoding,
> will know what I mean. Far worse happens again when macros with content
> from one encoding are expanded into files with different encoding.
>
> The current situation of letting everybody do what they want is a mess.
> That's what standardisation is for: imposition of order upon chaos.
I agree with this.
>
> Just make the entire lot UTF-8!
I don't think this is realistic or feasible, at least not for all platforms.
> And let individual files opt-out if they
> want, or whole TUs if the user asks the compiler to do so, with the
> standard making it very clear that anything other than UTF-8 =
> implementation defined behaviour for C++ 23 onwards.

This is already the status quo. If something is implementation defined,
then it is implementation defined.

Tom.

>
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-08-14 15:46:47