Subject: Re: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature
From: Thiago Macieira (thiago_at_[hidden])
Date: 2020-10-18 11:44:23
On Sunday, 18 October 2020 09:29:59 PDT Tom Honermann wrote:
> I see it a little differently. I think the marker is most useful to enable
> use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for
> "reasons". Such projects could use the marker in a few ways; 1) to mark
> their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to
> mark all source files. I see little motivation for the third option;
> defaults are useful. Likewise, the second option competes with the
> long-term direction to make UTF-8 the default (thanks to Corentin for
> making the point that marking UTF-8 files would continue the C++ trend of
> getting the defaults wrong), so that would not be considered a best
> practice, but could make sense for projects that have "ANSI" or EBCDIC
> encoded source files and complicated non-UTF-8 build/CI/CD processes, but
> where those processes are less complicated for third party dependencies.
> Option 1 is then left as best practice.
That I agree with.
To be clear what I'm agreeing with: UTF-8 source files aren't generally marked
in any special way, but non-UTF-8 source files get a marker that indicates
their actual encoding. That is aligned with the general direction of not using
BOM and of defaulting to UTF-8 unless otherwise instructed.
Said marker must be a no-op for backwards compatibility as we're talking about
tools and environment that "for reasons" can't update to tools that can handle
UTF-8 properlry. That means we can't have a C++ attribute:
Ditto for a pragma:
#pragma encoding "cp1252"
That leaves out-of-band markers (filesystem metadata) or specially-formatted
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel DPG Cloud Engineering
SG16 list run by firstname.lastname@example.org