liaison: Re: [wg14/wg21 liaison] [isocpp-core] [SG16-Unicode] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Tue, 13 Aug 2019 17:03:32 +0100

On 13/08/2019 15:27, Herring, Davis via Core wrote:
>> Is it politically feasible for C++ 23 and C 2x to require
>> implementations to default to interpreting source files as either (i) 7
>> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
>> either 7 bit ASCII or UTF-8.
>
> We could specify the source file directly as a sequence of ISO 10646 abstract characters, or even as a sequence of UTF-8 code units, but the implementation could choose to interpret the disk file to contain KOI-7 N1 with some sort of escape sequences for other characters. You might say "That's not UTF-8 on disk!", to which the implementation replies "That's how my operating system natively stores UTF-8." and the standard replies "What's a disk?".

I think that's an unproductive way of looking at the situation.

I'd prefer to look at it this way:

1. How much existing code gets broken if when recompiled as C++ 23, the
default is now to assume UTF-8 input unless input is obviously not that?

(My guess: a fair bit of older code will break, but almost all of it
will never be compiled as C++ 23)

2. How much do we care if code containing non-UTF8 high bit characters
in its string literals breaks when the compiler language version is set
to C++ 23 or higher?

(My opinion: people using non-ASCII in string literals without an
accompanying unit test to verify the compiler is doing what you assumed
deserve to experience breakage)

3. What is the benefit to the ecosystem if the committee standardises
Unicode source files moving forwards?

(My opinion: people consistently underestimate the benefit if they live
in North America and work only with North American source code. I've had
contracts in the past where a full six weeks of my life went on
attempting mostly lossless up-conversions from multiple legacy encoded
source files into UTF-8 source files. Consider that most, but not all,
use of high bit characters in string literals is typically for testing
that i18n code works right in various borked character encodings, so
yes, fun few weeks. And by the way, there is an *amazing* Python module
full of machine learning heuristics for lossless upconverting legacy
encodings to UTF-8, it saved me a ton of work)

But all the above said:

4. Is this a productive use of committee time, when it would displace
other items?

(My opinion: No, probably not, we have much more important stuff before
WG21 for C++ 23. However I wouldn't say the same for WG14, personally, I
think there is a much bigger bang for the buck over there. Hence I ask
here for objections, if none, I'll ask WG14 what they think of the idea)

Niall

Received on 2019-08-13 11:05:39