C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-core] Source file encoding

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Wed, 14 Aug 2019 15:50:28 +0100
On 14/08/2019 14:37, Tom Honermann wrote:
> On 8/14/19 7:19 AM, Niall Douglas wrote:
>> Lots of great points earlier. I mostly agree with them.
>>
>>> I would support such a thing. All other languages went there and it
>>> works great for them. Python will for example assume utf8 in the absence
>>> of pragma.
>> This will be probably an underappreciated point: Python started off
>> pre-Unicode, same as C++, and later on switched the default from "your
>> current C locale" (i.e. only 7-bit ASCII was portable) into utf-8.
>>
>> Their world did not end. Some users complained, sure, but because it was
>> announced in advance, and one could pragma opt-out, it was fine.
>
> I suggest you read https://snarky.ca/why-python-3-exists. Some choice
> quotes:

The switchover happened long before Python 3.

However, it actually turns out that my memory is incorrect, and more
importantly, what Python actually did is instructive.

I had been thinking of when Python encoding pragmas came in, I had
remembered that old Python written in Latin1 stopped being accepted. I
remembered this occurring for my own code at the time, and I had to
upgrade encoding. I had thought that utf-8 was the new default, but this
is in fact wrong for Python 2.

What actually happened is that from Python 2.5 onwards, source code must
be 7-bit clean **ASCII**. Anything else produces an error.

If you want utf-8 source code for Python, you have a long list of
mechanisms for telling Python it is utf-8. But lack of choosing any
mechanism means any 8-bit-set characters = refusal to consume, because
it is unclear what the programmer intends here.

That seems to me a great precedent to choose here.

>> C++ could do with being bolder in becoming simpler and less surprising
>> for end users. It is not unreasonable for a German to type an umlaut
>> into a string literal, and expect that C++ source code to be portable
>> and unsurprising by default.
>
> Personally, I appreciate that the C++ committee is sensitive to backward
> compatibility. I agree we need to make things easier for programmers,
> and there are steps we can take that don't require a
> utf-8-all-the-things approach.

How about an ASCII-all-the-source approach instead?

#pragma encoding <encoding> switches encoding from the #pragma onwards.

Failing to specify encoding for any source code containing high bit set
characters, and the compiler being in C++ 23 mode, equals refusal to
compile.

Niall

Received on 2019-08-14 16:50:42