C++ Logo

SG16

Advanced search

Subject: Re: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature
From: Tom Honermann (tom_at_[hidden])
Date: 2020-10-13 09:22:42


On 10/12/20 10:43 AM, Niall Douglas via SG16 wrote:
> 1. Unicode BOM is a perfectly legal codepoint which can appear
> anywhere in a text.
Correct^[1] , but this paper concerns guidelines for use, not what is
legal text.
>
> 2. For me, if the input has a BOM, I'm requesting that the output
> should have a BOM. All that well designed applications need to do is
> ignore BOM. Your rules below seem to overly favour specially treating
> BOM, in my opinion, when passthrough and ignoring it, if it is or is
> not present, seem to me better.

That seems reasonable for pass through protocols such as for a file
posting/copying service.  How does the following update sound?

    Protocol designers:

      * If possible, mandate use of UTF-8 without a BOM; diagnose the
        presence of a BOM in consumed text as an error, and produce text
        without a BOM.
      * Otherwise, if possible, mandate use of UTF-8 with or without a
        BOM; accept and discard a BOM in consumed text, produce text
        without a BOM, **and preserve a BOM when copying text**.
      * Otherwise, if possible, use UTF-8 as the default encoding with
        use of other encodings negotiated using information other than a
        BOM; accept and discard a BOM in consumed text, produce text
        without a BOM, **and preserve a BOM when copying text**.
      * Otherwise, require the presence of a BOM to differentiate UTF-8
        encoded text in both consumed and produced text unless the
        absence of a BOM would result in the text being interpreted as
        an ASCII-based encoding and the UTF-8 text contains no non-ASCII
        characters (the exception is intended to avoid the addition of a
        BOM to ASCII text thus rendering such text as non-ASCII). This
        approach should be reserved for scenarios in which UTF-8 cannot
        be adopted as a default due to backward compatibility concerns.

Tom.

[1]:   Pedantic: U+FEFF is allowed anywhere in a text, but is only a BOM
if it is the first code point in the text.  A U+FEFF that is not a BOM
is a ZWNBSP character.  U+2060 WORD JOINER should be used instead of
U+FEFF in modern text.

>
> Niall
>
> On 12/10/2020 15:02, Tom Honermann via SG16 wrote:
>> Great, here is the change I'm making to address this:
>>
>>     Protocol designers:
>>
>>       * If possible, mandate use of UTF-8 without a BOM; diagnose the
>>         presence of a BOM in consumed text as an error, and produce text
>>         without a BOM.
>>       * Otherwise, if possible, mandate use of UTF-8 with or without a
>>         BOM; accept and discard a BOM in consumed text, and produce text
>>         without a BOM.
>>       * Otherwise, if possible, use UTF-8 as the default encoding with
>>         use of other encodings negotiated using information other than a
>>         BOM; accept and discard a BOM in consumed text, and produce text
>>         without a BOM.
>>       * Otherwise, require the presence of a BOM to differentiate
>> UTnoteF-8
>>         encoded text in both consumed and produced text*unless the
>>         absence of a BOM would result in the text being interpreted as
>>         an ASCII-based encoding and the UTF-8 text contains no non-ASCII
>>         characters (the exception is intended to avoid the addition of a
>>         BOM to ASCII text thus rendering such text as non-ASCII)*. This
>>         approach should be reserved for scenarios in which UTF-8 cannot
>>         be adopted as a default due to backward compatibility concerns.
>>
>> Tom.
>>
>> On 10/12/20 8:40 AM, Alisdair Meredith wrote:
>>> That addresses my main concern.  Essentially, best practice (for
>>> UTF-8) would be no BOM unless the document contains code points that
>>> require multiple code units to express.
>>>
>>> AlisdairM
>>>
>>>> On Oct 11, 2020, at 23:22, Tom Honermann <tom_at_[hidden]
>>>> <mailto:tom_at_[hidden]>> wrote:
>>>>
>>>> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>>>>> One concern I have, that might lead into rationale for the current
>>>>> discouragement,
>>>>> is that I would hate to see a best practice that pushes a BOM into
>>>>> ASCII files.
>>>>> One of the nice properties of UTF-8 is that a valid ASCII file
>>>>> (still very common) is
>>>>> also a valid UTF-8 file.  Changing best practice would encourage
>>>>> updating those
>>>>> files to be no longer ASCII.
>>>>
>>>> Thanks, Alisdair.  I think that concern is implicitly addressed by
>>>> the suggested resolutions, but perhaps that can be made more
>>>> clear.  One possibility would be to modify the "protocol designer"
>>>> guidelines to address the case where a protocol's default encoding
>>>> is ASCII based and to specify that a BOM is only required for UTF-8
>>>> text that contains non-ASCII characters.  Would that be helpful?
>>>>
>>>> Tom.
>>>>
>>>>>
>>>>> AlisdairM
>>>>>
>>>>>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16
>>>>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>>>>>
>>>>>> Attached is a draft proposal for the Unicode standard that
>>>>>> intends to clarify the current recommendation regarding use of a
>>>>>> BOM in UTF-8 text. This is follow up to discussion on the Unicode
>>>>>> mailing list
>>>>>> <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html>
>>>>>> back in June.
>>>>>>
>>>>>> Feedback is welcome.  I plan to submit
>>>>>> <https://www.unicode.org/pending/docsubmit.html> this to the UTC
>>>>>> in a week or so pending review feedback.
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>> <Unicode-BOM-guidance.pdf>--
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>>
>>>>
>>>
>>
>>



SG16 list run by sg16-owner@lists.isocpp.org