sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Thiago Macieira <thiago_at_[hidden]>
Date: Sun, 18 Oct 2020 09:44:23 -0700

On Sunday, 18 October 2020 09:29:59 PDT Tom Honermann wrote:
> I see it a little differently. I think the marker is most useful to enable
> use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for
> "reasons". Such projects could use the marker in a few ways; 1) to mark
> their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to
> mark all source files. I see little motivation for the third option;
> defaults are useful. Likewise, the second option competes with the
> long-term direction to make UTF-8 the default (thanks to Corentin for
> making the point that marking UTF-8 files would continue the C++ trend of
> getting the defaults wrong), so that would not be considered a best
> practice, but could make sense for projects that have "ANSI" or EBCDIC
> encoded source files and complicated non-UTF-8 build/CI/CD processes, but
> where those processes are less complicated for third party dependencies.
> Option 1 is then left as best practice.

That I agree with.

To be clear what I'm agreeing with: UTF-8 source files aren't generally marked
in any special way, but non-UTF-8 source files get a marker that indicates
their actual encoding. That is aligned with the general direction of not using
BOM and of defaulting to UTF-8 unless otherwise instructed.

Corollary:

Said marker must be a no-op for backwards compatibility as we're talking about
tools and environment that "for reasons" can't update to tools that can handle
UTF-8 properlry. That means we can't have a C++ attribute:
[[encoding: cp1252]]

Ditto for a pragma:

#pragma encoding "cp1252"

That leaves out-of-band markers (filesystem metadata) or specially-formatted
comments.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2020-10-18 11:44:29