sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 16 Oct 2020 08:44:58 -0700

On Tuesday, 13 October 2020 22:16:14 PDT Tom Honermann via SG16 wrote:
> > Everyone already knows the best practice: “Use UTF-8”. Any
> > resources/effort is going to be getting toward that best practice, not
> > edge cases of legacy behaviors that are offshoots of something that
> > isn’t the desired end state of “use UTF-8”.
>
> My goal is exactly to ease migration to that end state. We can't
> reasonably synchronize a migration of all C++ projects to UTF-8. To get
> to that end state, we'll have to enable C++ projects to independently
> transition to UTF-8.

I think we can. We just need critical mass.

The status quo has remained because there has been nothing forcing a change to
status quo. Yes, there's a lot of old codebase that, for example, might have
comments written in Chinese or Finnish or something else. But nothing has
forced those to update. If the critical mass of software is UTF-8, that will
force those codebases to recode. And unlike Microsoft's fixing of their own
SDK header files to comply with the language, this is a simple recode
operation. It can be done by downstream users, with little to no danger.

For my (Qt's) part, we're already doing it: Qt 6.0 is UTF-8, uses UTF-8 in its
headers, which have no BOM markers, assumes char is UTF-8 and inserts MSVC
option "/utf-8" by default in the buildsystem.

> Use of a BOM would be one way to get to that desired end state but, as
> you mentioned, a BOM isn't a great way to identify UTF-8 data. The
> Unicode standard already admits this with the quoted "not recommended"
> text, but it lacks the rationale to defend that recommendation or to
> explain when it may be appropriate to disregard that recommendation. My
> goal with this paper is to fill that hole. If you don't care for how
> I've proposed it to be filled, that is certainly ok and alternative
> suggestions are welcome.

Your paper discourages the use of BOM, as I think it should. It does not say
what solution tooling should adopt to figure out a source is UTF-8 or not.

Or, if I read it differently, we can apply the Sherlock Holmes solution: when
you remove the impossible, whatever remains, however improbably, must be the
truth. We don't have any mechanism for tagging files out-of-band to indicate
they are UTF-8. Therefore, the only solution is an in-band marker: the BOM. Is
that what you're proposing?

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2020-10-16 10:45:54