sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Fri, 16 Oct 2020 09:09:45 -0700

> I think we can. We just need critical mass.

+1

As another data point we (Facebook) also have a massive body of code that
already assumes char strings are UTF-8.

- Victor

On Fri, Oct 16, 2020 at 8:46 AM Thiago Macieira via SG16 <
sg16_at_[hidden]> wrote:

> On Tuesday, 13 October 2020 22:16:14 PDT Tom Honermann via SG16 wrote:
> > > Everyone already knows the best practice: “Use UTF-8”. Any
> > > resources/effort is going to be getting toward that best practice, not
> > > edge cases of legacy behaviors that are offshoots of something that
> > > isn’t the desired end state of “use UTF-8”.
> >
> > My goal is exactly to ease migration to that end state. We can't
> > reasonably synchronize a migration of all C++ projects to UTF-8. To get
> > to that end state, we'll have to enable C++ projects to independently
> > transition to UTF-8.
>
> I think we can. We just need critical mass.
>
> The status quo has remained because there has been nothing forcing a
> change to
> status quo. Yes, there's a lot of old codebase that, for example, might
> have
> comments written in Chinese or Finnish or something else. But nothing has
> forced those to update. If the critical mass of software is UTF-8, that
> will
> force those codebases to recode. And unlike Microsoft's fixing of their
> own
> SDK header files to comply with the language, this is a simple recode
> operation. It can be done by downstream users, with little to no danger.
>
> For my (Qt's) part, we're already doing it: Qt 6.0 is UTF-8, uses UTF-8 in
> its
> headers, which have no BOM markers, assumes char is UTF-8 and inserts MSVC
> option "/utf-8" by default in the buildsystem.
>
> > Use of a BOM would be one way to get to that desired end state but, as
> > you mentioned, a BOM isn't a great way to identify UTF-8 data. The
> > Unicode standard already admits this with the quoted "not recommended"
> > text, but it lacks the rationale to defend that recommendation or to
> > explain when it may be appropriate to disregard that recommendation. My
> > goal with this paper is to fill that hole. If you don't care for how
> > I've proposed it to be filled, that is certainly ok and alternative
> > suggestions are welcome.
>
> Your paper discourages the use of BOM, as I think it should. It does not
> say
> what solution tooling should adopt to figure out a source is UTF-8 or not.
>
> Or, if I read it differently, we can apply the Sherlock Holmes solution:
> when
> you remove the impossible, whatever remains, however improbably, must be
> the
> truth. We don't have any mechanism for tagging files out-of-band to
> indicate
> they are UTF-8. Therefore, the only solution is an in-band marker: the
> BOM. Is
> that what you're proposing?
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel DPG Cloud Engineering
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-10-16 11:09:58