C++ Logo


Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] Source file encoding
From: Tom Honermann (tom_at_[hidden])
Date: 2019-08-14 22:31:16

On 8/14/19 10:50 AM, Niall Douglas wrote:
> On 14/08/2019 14:37, Tom Honermann wrote:
>> On 8/14/19 7:19 AM, Niall Douglas wrote:
>>> Lots of great points earlier. I mostly agree with them.
>>>> I would support such a thing. All  other languages went there and it
>>>> works great for them. Python will for example assume utf8 in the absence
>>>> of pragma.
>>> This will be probably an underappreciated point: Python started off
>>> pre-Unicode, same as C++, and later on switched the default from "your
>>> current C locale" (i.e. only 7-bit ASCII was portable) into utf-8.
>>> Their world did not end. Some users complained, sure, but because it was
>>> announced in advance, and one could pragma opt-out, it was fine.
>> I suggest you read https://snarky.ca/why-python-3-exists.%c2  Some choice
>> quotes:
> The switchover happened long before Python 3.
> However, it actually turns out that my memory is incorrect, and more
> importantly, what Python actually did is instructive.
> I had been thinking of when Python encoding pragmas came in, I had
> remembered that old Python written in Latin1 stopped being accepted. I
> remembered this occurring for my own code at the time, and I had to
> upgrade encoding. I had thought that utf-8 was the new default, but this
> is in fact wrong for Python 2.
> What actually happened is that from Python 2.5 onwards, source code must
> be 7-bit clean **ASCII**. Anything else produces an error.
> If you want utf-8 source code for Python, you have a long list of
> mechanisms for telling Python it is utf-8. But lack of choosing any
> mechanism means any 8-bit-set characters = refusal to consume, because
> it is unclear what the programmer intends here.
> That seems to me a great precedent to choose here.
>>> C++ could do with being bolder in becoming simpler and less surprising
>>> for end users. It is not unreasonable for a German to type an umlaut
>>> into a string literal, and expect that C++ source code to be portable
>>> and unsurprising by default.
>> Personally, I appreciate that the C++ committee is sensitive to backward
>> compatibility.  I agree we need to make things easier for programmers,
>> and there are steps we can take that don't require a
>> utf-8-all-the-things approach.
> How about an ASCII-all-the-source approach instead?

Copying from an earlier reply to you...

There are existing implementations where, by default, source files are
assumed to be encoded with some EBCDIC code page. I don’t want to break
those implementations, nor impose the significant burden such a change
would place on users of those implementations.

> #pragma encoding <encoding> switches encoding from the #pragma onwards.

I'm opposed to this behavior because the result is source files that
don't have just one encoding.  This is what Python does and it enables
strange things like files that start out in ASCII and then switch to
UTF-16.  My preferred approach is a #pragma directive that can only
appear at most once, must appear before any preprocessor directives or
source tokens (but can appear after comments), and must appear in the
first 4096 bytes of the source file (a common size for a single page of
memory) so that it can be quickly scanned for.


> Failing to specify encoding for any source code containing high bit set
> characters, and the compiler being in C++ 23 mode, equals refusal to
> compile.
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

SG16 list run by sg16-owner@lists.isocpp.org