sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 16 Oct 2020 17:09:59 -0400

On 10/16/20 11:44 AM, Thiago Macieira via SG16 wrote:
> On Tuesday, 13 October 2020 22:16:14 PDT Tom Honermann via SG16 wrote:
>>> Everyone already knows the best practice: “Use UTF-8”. Any
>>> resources/effort is going to be getting toward that best practice, not
>>> edge cases of legacy behaviors that are offshoots of something that
>>> isn’t the desired end state of “use UTF-8”.
>> My goal is exactly to ease migration to that end state. We can't
>> reasonably synchronize a migration of all C++ projects to UTF-8. To get
>> to that end state, we'll have to enable C++ projects to independently
>> transition to UTF-8.
> I think we can. We just need critical mass.
>
> The status quo has remained because there has been nothing forcing a change to
> status quo. Yes, there's a lot of old codebase that, for example, might have
> comments written in Chinese or Finnish or something else. But nothing has
> forced those to update. If the critical mass of software is UTF-8, that will
> force those codebases to recode. And unlike Microsoft's fixing of their own
> SDK header files to comply with the language, this is a simple recode
> operation. It can be done by downstream users, with little to no danger.
>
> For my (Qt's) part, we're already doing it: Qt 6.0 is UTF-8, uses UTF-8 in its
> headers, which have no BOM markers, assumes char is UTF-8 and inserts MSVC
> option "/utf-8" by default in the buildsystem.
>
>> Use of a BOM would be one way to get to that desired end state but, as
>> you mentioned, a BOM isn't a great way to identify UTF-8 data. The
>> Unicode standard already admits this with the quoted "not recommended"
>> text, but it lacks the rationale to defend that recommendation or to
>> explain when it may be appropriate to disregard that recommendation. My
>> goal with this paper is to fill that hole. If you don't care for how
>> I've proposed it to be filled, that is certainly ok and alternative
>> suggestions are welcome.
> Your paper discourages the use of BOM, as I think it should. It does not say
> what solution tooling should adopt to figure out a source is UTF-8 or not.
That is a different paper that I have yet to finish.
>
> Or, if I read it differently, we can apply the Sherlock Holmes solution: when
> you remove the impossible, whatever remains, however improbably, must be the
> truth. We don't have any mechanism for tagging files out-of-band to indicate
> they are UTF-8. Therefore, the only solution is an in-band marker: the BOM. Is
> that what you're proposing?
>
No, the goal of this paper is solely to clarify the UTC's position on
use of a BOM in UTF-8. Based on both private and public responses I've
received, there is an apparent lack of agreement on what the Unicode
standard states.

My response to Shawn Steele listed the options up for consideration;
copied below. I haven't settled on a recommendation yet, but lean
towards #3 with room for implementation-defined behavior to make use of
the other three.

1. Use of a BOM to indicate UTF-8 encoded source files. This matches
    existing practice for the Microsoft compiler.
2. Use of a #pragma. This matches existing practice
    <https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm>
    for the IBM compiler.
3. Use of a "magic" or "semantic" comment. This matches existing
    practice
    <https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations>
    in Python.
4. Use of filesystem meta data. This is an option for some compilers
    and is being considered for Clang on z/OS.

Tom.

Received on 2020-10-16 16:10:04