C++ Logo

sg16

Advanced search

Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 6 May 2021 18:38:11 -0400
On 5/6/21 3:22 PM, Thiago Macieira via SG16 wrote:
> On Thursday, 6 May 2021 12:14:35 PDT Ville Voutilainen wrote:
>> Of course it does. It always has.
> Thanks, Ville.
>
> That was a strawman argument to show that the barrier to the feature can be
> unreasonably high, thus making it as good as useless. That is what I'd like to
> see fixed next.
>
> Not only should there be an easy way to enable the UTF-8 support, it should be
> enabled by something in the source file itself, not a external to it.
>
My plan is to submit a paper that discusses the following possibilities:

  * A new pragma directive. There is existing practice in the form of
    IBM's #pragma filetag directive
    <https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm>.
    #pragma encoding(encoding-name)
  * A magic comment. Very likely the Python encoding declaration
    <https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations>.
    // -*- coding: <encoding-name> -*-
  * Use of a BOM

In all three cases, the intent is that differently encoded source files
will be usable within the same translation unit.

In the first two cases, there will be restrictions regarding where in
the encoding declaration may appear; e.g., it must be wholly contained
within the first 4k bytes of the file. The paper will discuss how
implementations with a default encoding that differs from the encoding
specified by the encoding declaration will identify the declaration.
This is really only relevant for ASCII-based vs EBCDIC-based concerns.

My present intent is to propose the magic comment solution since it
avoids the
but-my-compiler-warns-about-unrecognized-pragmas-even-though-it-shouldn't
issue. Per Corentin's paper, implementations will still be able to rely
on a command line option, BOM, pragma directive, filesystem metadata,
whatever, to determine an encoding in the absence of an encoding
declaration. The paper will also discuss the
what-if-the-encoding-declaration-doesn't-match-the-actual-file-encoding
issue (UB of course).

Tom.


Received on 2021-05-06 17:38:14