On 5/6/21 3:22 PM, Thiago Macieira via SG16 wrote:

On Thursday, 6 May 2021 12:14:35 PDT Ville Voutilainen wrote:

Of course it does. It always has.

Thanks, Ville.

That was a strawman argument to show that the barrier to the feature can be 
unreasonably high, thus making it as good as useless. That is what I'd like to 
see fixed next.

Not only should there be an easy way to enable the UTF-8 support, it should be 
enabled by something in the source file itself, not a external to it.

My plan is to submit a paper that discusses the following possibilities:

A new pragma directive. There is existing practice in the form of IBM's #pragma filetag directive.
#pragma encoding(encoding-name)
A magic comment. Very likely the Python encoding declaration.
// -*- coding: <encoding-name> -*-
Use of a BOM

In all three cases, the intent is that differently encoded source files will be usable within the same translation unit.

In the first two cases, there will be restrictions regarding where in the encoding declaration may appear; e.g., it must be wholly contained within the first 4k bytes of the file. The paper will discuss how implementations with a default encoding that differs from the encoding specified by the encoding declaration will identify the declaration. This is really only relevant for ASCII-based vs EBCDIC-based concerns.

My present intent is to propose the magic comment solution since it avoids the but-my-compiler-warns-about-unrecognized-pragmas-even-though-it-shouldn't issue. Per Corentin's paper, implementations will still be able to rely on a command line option, BOM, pragma directive, filesystem metadata, whatever, to determine an encoding in the absence of an encoding declaration. The paper will also discuss the what-if-the-encoding-declaration-doesn't-match-the-actual-file-encoding issue (UB of course).

Tom.