sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Mon, 12 Oct 2020 15:43:04 +0100

1. Unicode BOM is a perfectly legal codepoint which can appear anywhere
in a text.

2. For me, if the input has a BOM, I'm requesting that the output should
have a BOM. All that well designed applications need to do is ignore
BOM. Your rules below seem to overly favour specially treating BOM, in
my opinion, when passthrough and ignoring it, if it is or is not
present, seem to me better.

Niall

On 12/10/2020 15:02, Tom Honermann via SG16 wrote:
> Great, here is the change I'm making to address this:
>
> Protocol designers:
>
> * If possible, mandate use of UTF-8 without a BOM; diagnose the
> presence of a BOM in consumed text as an error, and produce text
> without a BOM.
> * Otherwise, if possible, mandate use of UTF-8 with or without a
> BOM; accept and discard a BOM in consumed text, and produce text
> without a BOM.
> * Otherwise, if possible, use UTF-8 as the default encoding with
> use of other encodings negotiated using information other than a
> BOM; accept and discard a BOM in consumed text, and produce text
> without a BOM.
> * Otherwise, require the presence of a BOM to differentiate UTF-8
> encoded text in both consumed and produced text*unless the
> absence of a BOM would result in the text being interpreted as
> an ASCII-based encoding and the UTF-8 text contains no non-ASCII
> characters (the exception is intended to avoid the addition of a
> BOM to ASCII text thus rendering such text as non-ASCII)*. This
> approach should be reserved for scenarios in which UTF-8 cannot
> be adopted as a default due to backward compatibility concerns.
>
> Tom.
>
> On 10/12/20 8:40 AM, Alisdair Meredith wrote:
>> That addresses my main concern. Essentially, best practice (for
>> UTF-8) would be no BOM unless the document contains code points that
>> require multiple code units to express.
>>
>> AlisdairM
>>
>>> On Oct 11, 2020, at 23:22, Tom Honermann <tom_at_[hidden]
>>> <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>>>> One concern I have, that might lead into rationale for the current
>>>> discouragement,
>>>> is that I would hate to see a best practice that pushes a BOM into
>>>> ASCII files.
>>>> One of the nice properties of UTF-8 is that a valid ASCII file
>>>> (still very common) is
>>>> also a valid UTF-8 file. Changing best practice would encourage
>>>> updating those
>>>> files to be no longer ASCII.
>>>
>>> Thanks, Alisdair. I think that concern is implicitly addressed by
>>> the suggested resolutions, but perhaps that can be made more clear.
>>> One possibility would be to modify the "protocol designer" guidelines
>>> to address the case where a protocol's default encoding is ASCII
>>> based and to specify that a BOM is only required for UTF-8 text that
>>> contains non-ASCII characters. Would that be helpful?
>>>
>>> Tom.
>>>
>>>>
>>>> AlisdairM
>>>>
>>>>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16
>>>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>>>>
>>>>> Attached is a draft proposal for the Unicode standard that intends
>>>>> to clarify the current recommendation regarding use of a BOM in
>>>>> UTF-8 text. This is follow up to discussion on the Unicode mailing
>>>>> list
>>>>> <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html>
>>>>> back in June.
>>>>>
>>>>> Feedback is welcome. I plan to submit
>>>>> <https://www.unicode.org/pending/docsubmit.html> this to the UTC in
>>>>> a week or so pending review feedback.
>>>>>
>>>>> Tom.
>>>>>
>>>>> <Unicode-BOM-guidance.pdf>--
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>>
>>>
>>
>
>

Received on 2020-10-12 09:43:08