C++ Logo


Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] Source file encoding
From: Corentin (corentin.jabot_at_[hidden])
Date: 2019-08-14 08:48:20

On Wed, Aug 14, 2019, 3:37 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 8/14/19 5:00 AM, Corentin wrote:
> On Wed, Aug 14, 2019, 4:17 AM Tom Honermann via Core <
> core_at_[hidden]> wrote:
>> Niall, this is again off topic for this thread. But now that you put
>> this out there, I feel obligated to respond. But please start a new
>> thread with a different set of mailing lists if you wish to continue
>> this any further; this is not a CWG issue.
>> On 8/13/19 12:03 PM, Niall Douglas via Liaison wrote:
>> > On 13/08/2019 15:27, Herring, Davis via Core wrote:
>> >>> Is it politically feasible for C++ 23 and C 2x to require
>> >>> implementations to default to interpreting source files as either (i)
>> 7
>> >>> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
>> >>> either 7 bit ASCII or UTF-8.
>> >> We could specify the source file directly as a sequence of ISO 10646
>> abstract characters, or even as a sequence of UTF-8 code units, but the
>> implementation could choose to interpret the disk file to contain KOI-7 N1
>> with some sort of escape sequences for other characters. You might say
>> "That's not UTF-8 on disk!", to which the implementation replies "That's
>> how my operating system natively stores UTF-8." and the standard replies
>> "What's a disk?".
>> > I think that's an unproductive way of looking at the situation.
>> >
>> > I'd prefer to look at it this way:
>> >
>> >
>> > 1. How much existing code gets broken if when recompiled as C++ 23, the
>> > default is now to assume UTF-8 input unless input is obviously not that?
>> *All* code built on non-ASCII platforms, some amount of code (primarily
>> in regions outside the US) that is currently built with the Microsoft
>> compiler and encoded according to the Windows Active Code Page for that
>> region, and source code encoded in Shift-JIS or GB18030.
>> >
>> > (My guess: a fair bit of older code will break, but almost all of it
>> > will never be compiled as C++ 23)
>> I think you'll need to find a way to measure the breakage if you want to
>> pursue such a change.
>> Personally, I don't think this is the right approach as adding more
>> assumptions about encodings seems likely to lead to even more problems.
>> My preference is to focus on explicit solutions like adding an encoding
>> pragma similarly to what is done in Python and HTML and is existing
>> practice for IBM's xlC compiler
>> (
>> https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm
>> ).
> Except all cross platform (windows, Linux, Mac) code ever written - which
> includes all of GitHub, etc, would use ASCII or utf8 already.
> Most internal code would avoid non basic character set characters already.
> Because they know it's not portable
> I lack confidence that this is true, so citation needed please. I know
> that Shift-JIS (for example) is still in use and we hear that from
> Microsoft representatives. Regardless, I think it is a mistake to assume
> that cross-platform code is more important than code that is written for
> specific platforms.
> So while I find the idea of pragma interesting, I question whether it is
> the right default. I do not want to have to do that to 100% of the I have
> or will ever write.
> It would certainly be the wrong default if we were doing a clean room
> design. But we are evolving a language that has been around for several
> decades and that inherits from a language that was around for considerably
> longer.
> It doesn't mean a pragma is not helpful for people working on an old code
> base so they can transition away from codepage encoding if they are ie, a
> windows shop only. I think it would very much be.
> I think it would also be useful to encourage utf8 by default even if that
> would have no impact whatsoever on existing toolchains.
> I agree. I strongly think the right approach is:
> 1. Keep source file encoding implementation defined.
> 2. Introduce the pragma option to explicitly specify per-source-file
> encoding.
> 3. Encourage implementors to provide options to default the assumed
> source file encoding to UTF-8 (in practice, most already provide this)
> 4. Encourage projects to pass /source-file-encoding-is-utf-8 (however
> spelled) to their compiler invocations.
> That approach approximates the "right" default fairly closely if (4) is
> followed (which may be an existing trend).
That would work.

> But at the same time it seems it would be beneficial to restrict the set
> of features that require Unicode to be limited to Unicode source files,
> including literals and identifiers outside of the basic character sets.
> The intent is that making a program ill-formed (ndr) encourages a warning
> which I really want to have when the compiler is not interpreting my utf-8
> source as utf-8.
> I strongly disagree with this. I think you are conflating two distinct
> things (source file encoding and support for Unicode) as a proxy to get a
> diagnostic that, in practice, would not be reliable.

I am not. Of all the things that it might be beneficial to have
implementation defined behavior for, how identifiers and text is handle by
compilers is not one of them. I know we disagree on that. But it will be
hard to convince me that predicable compiler behavior is not in this
instance valuable.

> You could argue that people on windows
> can just compile with /source-charset: utf-8, which yes they can and
> should (it's standard practice in Qt, vcpkg, etc), but avoid potentially
> lossy encoding due to wrong presumption of how a text file was encoded
> would help people write portable code with the assurance that the compiler
> would not miss interpret their intent silently.
> I agree with you that reinterpreting all existing code overnight as utf-8
> would hinder the adoption of future c++ version enough that we should
> probably avoid to do that, but maybe a slight encouragement to use utf8
> would be beneficial to everyone.
> I agree with Niall, people in NA/Europe underestimate the extent of the
> issue with source encoding.
> I agree with this. But I think there is a reverse underestimation as well
> - that being the extent to which people outside English speaking regions
> use non-UTF-8 encodings. IBM/Windows code pages and the ISO-8859 series of
> character sets have a long history. I think there is good reason to
> believe they are still in use, particularly in older code bases.

Generalisation: No C++ developer has the tool to internationalize their
software. And all of iso encoding combined are still a very small subset of
used characters compared to Unicode.

> Tom.
>> >
>> >
>> > 2. How much do we care if code containing non-UTF8 high bit characters
>> > in its string literals breaks when the compiler language version is set
>> > to C++ 23 or higher?
>> >
>> > (My opinion: people using non-ASCII in string literals without an
>> > accompanying unit test to verify the compiler is doing what you assumed
>> > deserve to experience breakage)
>> Instead of non-ASCII, I think you mean characters outside the basic
>> source character set.
>> Testing practices have varied widely over time and across projects. I
>> don't think it is acceptable to think it ok for other people's code to
>> break because it wasn't developed to your standards.
>> >
>> >
>> > 3. What is the benefit to the ecosystem if the committee standardises
>> > Unicode source files moving forwards?
>> >
>> > (My opinion: people consistently underestimate the benefit if they live
>> > in North America and work only with North American source code. I've had
>> > contracts in the past where a full six weeks of my life went on
>> > attempting mostly lossless up-conversions from multiple legacy encoded
>> > source files into UTF-8 source files. Consider that most, but not all,
>> > use of high bit characters in string literals is typically for testing
>> > that i18n code works right in various borked character encodings, so
>> > yes, fun few weeks. And by the way, there is an *amazing* Python module
>> > full of machine learning heuristics for lossless upconverting legacy
>> > encodings to UTF-8, it saved me a ton of work)
>> I agree we need to provide better means for handling source file
>> encodings. But this all-or-nothing approach strikes me as very costly.
>> Many applications are composed from multiple projects. Improving support
>> for UTF-8 encoded source files will require means to adopt them
>> gradually. That means that there will be scenarios where a single TU is
>> built from differently encoded source files. We need a more fine grained
>> solution.
>> >
>> >
>> > But all the above said:
>> >
>> > 4. Is this a productive use of committee time, when it would displace
>> > other items?
>> >
>> > (My opinion: No, probably not, we have much more important stuff before
>> > WG21 for C++ 23. However I wouldn't say the same for WG14, personally, I
>> > think there is a much bigger bang for the buck over there. Hence I ask
>> > here for objections, if none, I'll ask WG14 what they think of the idea)
>> I think this is a productive use of SG16's time. I don't think it is a
>> productive use of the rest of the committee's time until we have a
>> proposal to offer.
>> Tom.
>> >
>> >
>> > Niall
>> > _______________________________________________
>> > Liaison mailing list
>> > Liaison_at_[hidden]
>> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
>> > Link to this post: http://lists.isocpp.org/liaison/2019/08/0009.php
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2019/08/7045.php

SG16 list run by sg16-owner@lists.isocpp.org