sg16: Re: [SG16-Unicode] [wg14/wg21 liaison] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)

From: Ed Catmur <ed_at_[hidden]>
Date: Thu, 15 Aug 2019 01:54:25 +0100

On 14 August 2019 23:18:45 Niall Douglas via Liaison
<liaison_at_[hidden]> wrote:

> On 14/08/2019 19:24, Billy O'Neal (VC LIBS) wrote:
>>> Far more importantly, if the committee can assume unicode-clean
>> source code going forth, that makes far more tractable lots of other
>> problems such as how char string literals ought to be interpreted.
>>
>> I don't think this actually matters for implementations. The standard
>> can describe what happens for Unicode and let implementations figure out
>> what that means for the legacy encodings they target. An implementation
>> on an EBCDIC machine, for example, can do an 'as if' notional conversion
>> into UTF-8 for the purposes of following the standard's rules.
>
> Just to be clear, I'm not referring to anything about implementation
> quality nor correctness wrt source files here. That all pretty much
> "just works" for each compiler, or rather, each compiler can be poked
> and prodded to just work eventually.
>
> I *am* speaking about the user experience, where if the standard insists
> on ASCII-only-if-otherwise-not-specified, then typing umlauts into the
> source code will yield a useful compiler error saying "Please add a
> #pragma encoding to tell me what encoding this source file is". Like
> with Python 2.

"Unless otherwise specified" is a loophole large enough to drive a bus
through. What's to prevent implementors adding a mode that retains the
current behavior of inferring the source file encoding from the user's
environment and making that mode the default? And how would you as a user -
ie programmer - be able to tell the difference? Or rather, how can a source
file or header file tell what encoding the compiler thinks it has? If the
user misinforms the compiler as to the encoding of that source file, what
actually changes or breaks?

> Then because we always know the source file encoding, we can make other
> end user experience improvements. Most of the problems with encoding
> are, of course, the fact it isn't specified. This would fix that for one
> situation, which is C and C++ source code.

What improvements do you have in mind? I rather feel that following this
line might help to clarify things.

Note that the compiler already necessarily knows the source file encoding
and the execution encoding, to be able to perform the various [lex.phases].
Would it be enough or at least help to expose those, or at least the latter?

>
>
>
>
> That's my pitch. I pitch nothing regarding runtime encoding, which is a
> viper's nest, and will remain so for decades to come.
>
>
>
>
> Niall
>
>
>
>
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2019/08/0024.php

Received on 2019-08-15 03:01:28