sg16: Re: [SG16-Unicode] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 14 Aug 2019 11:00:16 +0200

On Wed, Aug 14, 2019, 4:17 AM Tom Honermann via Core <core_at_[hidden]>
wrote:

> Niall, this is again off topic for this thread. But now that you put
> this out there, I feel obligated to respond. But please start a new
> thread with a different set of mailing lists if you wish to continue
> this any further; this is not a CWG issue.
>
> On 8/13/19 12:03 PM, Niall Douglas via Liaison wrote:
> > On 13/08/2019 15:27, Herring, Davis via Core wrote:
> >>> Is it politically feasible for C++ 23 and C 2x to require
> >>> implementations to default to interpreting source files as either (i) 7
> >>> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
> >>> either 7 bit ASCII or UTF-8.
> >> We could specify the source file directly as a sequence of ISO 10646
> abstract characters, or even as a sequence of UTF-8 code units, but the
> implementation could choose to interpret the disk file to contain KOI-7 N1
> with some sort of escape sequences for other characters. You might say
> "That's not UTF-8 on disk!", to which the implementation replies "That's
> how my operating system natively stores UTF-8." and the standard replies
> "What's a disk?".
> > I think that's an unproductive way of looking at the situation.
> >
> > I'd prefer to look at it this way:
> >
> >
> > 1. How much existing code gets broken if when recompiled as C++ 23, the
> > default is now to assume UTF-8 input unless input is obviously not that?
> *All* code built on non-ASCII platforms, some amount of code (primarily
> in regions outside the US) that is currently built with the Microsoft
> compiler and encoded according to the Windows Active Code Page for that
> region, and source code encoded in Shift-JIS or GB18030.
> >
> > (My guess: a fair bit of older code will break, but almost all of it
> > will never be compiled as C++ 23)
>
> I think you'll need to find a way to measure the breakage if you want to
> pursue such a change.
>
> Personally, I don't think this is the right approach as adding more
> assumptions about encodings seems likely to lead to even more problems.
> My preference is to focus on explicit solutions like adding an encoding
> pragma similarly to what is done in Python and HTML and is existing
> practice for IBM's xlC compiler
> (
> https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm
> ).
>

Except all cross platform (windows, Linux, Mac) code ever written - which
includes all of GitHub, etc, would use ASCII or utf8 already.
Most internal code would avoid non basic character set characters already.
Because they know it's not portable

So while I find the idea of pragma interesting, I question whether it is
the right default. I do not want to have to do that to 100% of the I have
or will ever write.

It doesn't mean a pragma is not helpful for people working on an old code
base so they can transition away from codepage encoding if they are ie, a
windows shop only. I think it would very much be.

I think it would also be useful to encourage utf8 by default even if that
would have no impact whatsoever on existing toolchains.

But at the same time it seems it would be beneficial to restrict the set of
features that require Unicode to be limited to Unicode source files,
including literals and identifiers outside of the basic character sets.
The intent is that making a program ill-formed (ndr) encourages a warning
which I really want to have when the compiler is not interpreting my utf-8
source as utf-8.

You could argue that people on windows
can just compile with /source-charset: utf-8, which yes they can and should
(it's standard practice in Qt, vcpkg, etc), but avoid potentially lossy
encoding due to wrong presumption of how a text file was encoded would help
people write portable code with the assurance that the compiler would not
miss interpret their intent silently.

I agree with you that reinterpreting all existing code overnight as utf-8
would hinder the adoption of future c++ version enough that we should
probably avoid to do that, but maybe a slight encouragement to use utf8
would be beneficial to everyone.

I agree with Niall, people in NA/Europe underestimate the extent of the
issue with source encoding.

> >
> >
> > 2. How much do we care if code containing non-UTF8 high bit characters
> > in its string literals breaks when the compiler language version is set
> > to C++ 23 or higher?
> >
> > (My opinion: people using non-ASCII in string literals without an
> > accompanying unit test to verify the compiler is doing what you assumed
> > deserve to experience breakage)
>
> Instead of non-ASCII, I think you mean characters outside the basic
> source character set.
>
> Testing practices have varied widely over time and across projects. I
> don't think it is acceptable to think it ok for other people's code to
> break because it wasn't developed to your standards.
>
> >
> >
> > 3. What is the benefit to the ecosystem if the committee standardises
> > Unicode source files moving forwards?
> >
> > (My opinion: people consistently underestimate the benefit if they live
> > in North America and work only with North American source code. I've had
> > contracts in the past where a full six weeks of my life went on
> > attempting mostly lossless up-conversions from multiple legacy encoded
> > source files into UTF-8 source files. Consider that most, but not all,
> > use of high bit characters in string literals is typically for testing
> > that i18n code works right in various borked character encodings, so
> > yes, fun few weeks. And by the way, there is an *amazing* Python module
> > full of machine learning heuristics for lossless upconverting legacy
> > encodings to UTF-8, it saved me a ton of work)
> I agree we need to provide better means for handling source file
> encodings. But this all-or-nothing approach strikes me as very costly.
> Many applications are composed from multiple projects. Improving support
> for UTF-8 encoded source files will require means to adopt them
> gradually. That means that there will be scenarios where a single TU is
> built from differently encoded source files. We need a more fine grained
> solution.
> >
> >
> > But all the above said:
> >
> > 4. Is this a productive use of committee time, when it would displace
> > other items?
> >
> > (My opinion: No, probably not, we have much more important stuff before
> > WG21 for C++ 23. However I wouldn't say the same for WG14, personally, I
> > think there is a much bigger bang for the buck over there. Hence I ask
> > here for objections, if none, I'll ask WG14 what they think of the idea)
>
> I think this is a productive use of SG16's time. I don't think it is a
> productive use of the rest of the committee's time until we have a
> proposal to offer.
>
> Tom.
>
> >
> >
> > Niall
> > _______________________________________________
> > Liaison mailing list
> > Liaison_at_[hidden]
> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> > Link to this post: http://lists.isocpp.org/liaison/2019/08/0009.php
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2019/08/7045.php
>

Received on 2019-08-14 11:17:26