sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 18 Oct 2020 12:59:13 -0400

On 10/18/20 12:44 PM, Thiago Macieira wrote:
> On Sunday, 18 October 2020 09:29:59 PDT Tom Honermann wrote:
>> I see it a little differently. I think the marker is most useful to enable
>> use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for
>> "reasons". Such projects could use the marker in a few ways; 1) to mark
>> their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to
>> mark all source files. I see little motivation for the third option;
>> defaults are useful. Likewise, the second option competes with the
>> long-term direction to make UTF-8 the default (thanks to Corentin for
>> making the point that marking UTF-8 files would continue the C++ trend of
>> getting the defaults wrong), so that would not be considered a best
>> practice, but could make sense for projects that have "ANSI" or EBCDIC
>> encoded source files and complicated non-UTF-8 build/CI/CD processes, but
>> where those processes are less complicated for third party dependencies.
>> Option 1 is then left as best practice.
> That I agree with.
>
> To be clear what I'm agreeing with: UTF-8 source files aren't generally marked
> in any special way, but non-UTF-8 source files get a marker that indicates
> their actual encoding. That is aligned with the general direction of not using
> BOM and of defaulting to UTF-8 unless otherwise instructed.
Good, we're on the same page there.
>
> Corollary:
>
> Said marker must be a no-op for backwards compatibility as we're talking about
> tools and environment that "for reasons" can't update to tools that can handle
> UTF-8 properlry. That means we can't have a C++ attribute:
> [[encoding: cp1252]]
>
> Ditto for a pragma:
>
> #pragma encoding "cp1252"
>
> That leaves out-of-band markers (filesystem metadata) or specially-formatted
> comments.

Mostly agreed. The standard states that unrecognized pragma directives
<http://eel.is/c++draft/cpp.pragma#1> and unrecognized attributes
<http://eel.is/c++draft/dcl.attr#grammar-6.sentence-2> are to be
ignored, but we know that doesn't reflect existing practice. That
doesn't mean that we can't use those, it means that they would have to
be guarded by a predefined macro:

    #if defined(__cpp_encoding_attribute)
    [[ encoding: cp1252 ]];
    #endif

or

    #if defined(__cpp_encoding_pragma)
    #pragma encoding "cp1252"
    #endif

That would complicate both the implementability and usability of the
feature, so I wouldn't expect WG21 to want to go that route.

Tom.

Received on 2020-10-18 11:59:17