liaison: Re: [wg14/wg21 liaison] [isocpp-core] Source file encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Aug 2019 09:37:35 -0400

On 8/14/19 5:00 AM, Corentin wrote:
>
>
> On Wed, Aug 14, 2019, 4:17 AM Tom Honermann via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>
> Niall, this is again off topic for this thread. But now that you put
> this out there, I feel obligated to respond. But please start a new
> thread with a different set of mailing lists if you wish to continue
> this any further; this is not a CWG issue.
>
> On 8/13/19 12:03 PM, Niall Douglas via Liaison wrote:
> > On 13/08/2019 15:27, Herring, Davis via Core wrote:
> >>> Is it politically feasible for C++ 23 and C 2x to require
> >>> implementations to default to interpreting source files as
> either (i) 7
> >>> bit ASCII or (ii) UTF-8? To be specific, char literals would
> thus be
> >>> either 7 bit ASCII or UTF-8.
> >> We could specify the source file directly as a sequence of ISO
> 10646 abstract characters, or even as a sequence of UTF-8 code
> units, but the implementation could choose to interpret the disk
> file to contain KOI-7 N1 with some sort of escape sequences for
> other characters. You might say "That's not UTF-8 on disk!", to
> which the implementation replies "That's how my operating system
> natively stores UTF-8." and the standard replies "What's a disk?".
> > I think that's an unproductive way of looking at the situation.
> >
> > I'd prefer to look at it this way:
> >
> >
> > 1. How much existing code gets broken if when recompiled as C++
> 23, the
> > default is now to assume UTF-8 input unless input is obviously
> not that?
> *All* code built on non-ASCII platforms, some amount of code
> (primarily
> in regions outside the US) that is currently built with the Microsoft
> compiler and encoded according to the Windows Active Code Page for
> that
> region, and source code encoded in Shift-JIS or GB18030.
> >
> > (My guess: a fair bit of older code will break, but almost all of it
> > will never be compiled as C++ 23)
>
> I think you'll need to find a way to measure the breakage if you
> want to
> pursue such a change.
>
> Personally, I don't think this is the right approach as adding more
> assumptions about encodings seems likely to lead to even more
> problems.
> My preference is to focus on explicit solutions like adding an
> encoding
> pragma similarly to what is done in Python and HTML and is existing
> practice for IBM's xlC compiler
> (https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm).
>
>
>
> Except all cross platform (windows, Linux, Mac) code ever written -
> which includes all of GitHub, etc, would use ASCII or utf8 already.
> Most internal code would avoid non basic character set characters
> already. Because they know it's not portable
I lack confidence that this is true, so citation needed please. I know
that Shift-JIS (for example) is still in use and we hear that from
Microsoft representatives. Regardless, I think it is a mistake to
assume that cross-platform code is more important than code that is
written for specific platforms.
>
> So while I find the idea of pragma interesting, I question whether it
> is the right default. I do not want to have to do that to 100% of the
> I have or will ever write.

It would certainly be the wrong default if we were doing a clean room
design. But we are evolving a language that has been around for several
decades and that inherits from a language that was around for
considerably longer.

>
> It doesn't mean a pragma is not helpful for people working on an old
> code base so they can transition away from codepage encoding if they
> are ie, a windows shop only. I think it would very much be.
>
> I think it would also be useful to encourage utf8 by default even if
> that would have no impact whatsoever on existing toolchains.
I agree. I strongly think the right approach is:

1. Keep source file encoding implementation defined.
2. Introduce the pragma option to explicitly specify per-source-file
    encoding.
3. Encourage implementors to provide options to default the assumed
    source file encoding to UTF-8 (in practice, most already provide this)
4. Encourage projects to pass /source-file-encoding-is-utf-8 (however
    spelled) to their compiler invocations.

That approach approximates the "right" default fairly closely if (4) is
followed (which may be an existing trend).

>
>
> But at the same time it seems it would be beneficial to restrict the
> set of features that require Unicode to be limited to Unicode source
> files, including literals and identifiers outside of the basic
> character sets.
> The intent is that making a program ill-formed (ndr) encourages a
> warning which I really want to have when the compiler is not
> interpreting my utf-8 source as utf-8.
I strongly disagree with this. I think you are conflating two distinct
things (source file encoding and support for Unicode) as a proxy to get
a diagnostic that, in practice, would not be reliable.
>
>
> You could argue that people on windows
> can just compile with /source-charset: utf-8, which yes they can and
> should (it's standard practice in Qt, vcpkg, etc), but avoid
> potentially lossy encoding due to wrong presumption of how a text file
> was encoded would help people write portable code with the assurance
> that the compiler would not miss interpret their intent silently.
>
> I agree with you that reinterpreting all existing code overnight as
> utf-8 would hinder the adoption of future c++ version enough that we
> should probably avoid to do that, but maybe a slight encouragement to
> use utf8 would be beneficial to everyone.
>
> I agree with Niall, people in NA/Europe underestimate the extent of
> the issue with source encoding.

I agree with this. But I think there is a reverse underestimation as
well - that being the extent to which people outside English speaking
regions use non-UTF-8 encodings. IBM/Windows code pages and the ISO-8859
series of character sets have a long history. I think there is good
reason to believe they are still in use, particularly in older code bases.

Tom.

>
>
>
>
> >
> >
> > 2. How much do we care if code containing non-UTF8 high bit
> characters
> > in its string literals breaks when the compiler language version
> is set
> > to C++ 23 or higher?
> >
> > (My opinion: people using non-ASCII in string literals without an
> > accompanying unit test to verify the compiler is doing what you
> assumed
> > deserve to experience breakage)
>
> Instead of non-ASCII, I think you mean characters outside the basic
> source character set.
>
> Testing practices have varied widely over time and across
> projects. I
> don't think it is acceptable to think it ok for other people's
> code to
> break because it wasn't developed to your standards.
>
> >
> >
> > 3. What is the benefit to the ecosystem if the committee
> standardises
> > Unicode source files moving forwards?
> >
> > (My opinion: people consistently underestimate the benefit if
> they live
> > in North America and work only with North American source code.
> I've had
> > contracts in the past where a full six weeks of my life went on
> > attempting mostly lossless up-conversions from multiple legacy
> encoded
> > source files into UTF-8 source files. Consider that most, but
> not all,
> > use of high bit characters in string literals is typically for
> testing
> > that i18n code works right in various borked character encodings, so
> > yes, fun few weeks. And by the way, there is an *amazing* Python
> module
> > full of machine learning heuristics for lossless upconverting legacy
> > encodings to UTF-8, it saved me a ton of work)
> I agree we need to provide better means for handling source file
> encodings. But this all-or-nothing approach strikes me as very
> costly.
> Many applications are composed from multiple projects. Improving
> support
> for UTF-8 encoded source files will require means to adopt them
> gradually. That means that there will be scenarios where a single
> TU is
> built from differently encoded source files. We need a more fine
> grained
> solution.
> >
> >
> > But all the above said:
> >
> > 4. Is this a productive use of committee time, when it would
> displace
> > other items?
> >
> > (My opinion: No, probably not, we have much more important stuff
> before
> > WG21 for C++ 23. However I wouldn't say the same for WG14,
> personally, I
> > think there is a much bigger bang for the buck over there. Hence
> I ask
> > here for objections, if none, I'll ask WG14 what they think of
> the idea)
>
> I think this is a productive use of SG16's time. I don't think it
> is a
> productive use of the rest of the committee's time until we have a
> proposal to offer.
>
> Tom.
>
> >
> >
> > Niall
> > _______________________________________________
> > Liaison mailing list
> > Liaison_at_[hidden] <mailto:Liaison_at_[hidden]>
> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> > Link to this post: http://lists.isocpp.org/liaison/2019/08/0009.php
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden] <mailto:Core_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2019/08/7045.php
>

Received on 2019-08-14 08:39:37