sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 13 Oct 2020 14:49:30 -0400

On 10/13/20 1:00 PM, Niall Douglas via SG16 wrote:
> On 13/10/2020 15:22, Tom Honermann wrote:
>
>>> 2. For me, if the input has a BOM, I'm requesting that the output
>>> should have a BOM. All that well designed applications need to do is
>>> ignore BOM. Your rules below seem to overly favour specially
>>> treating BOM, in my opinion, when passthrough and ignoring it, if it
>>> is or is not present, seem to me better.
>>
>> That seems reasonable for pass through protocols such as for a file
>> posting/copying service. How does the following update sound?
>
> I'm happy to defer to SG16's opinion on this if I'm an opinion of one.
> However, here would be my protocol recommendations:
>
> * U+FEFF is allowed anywhere in a UTF text, including at the very
> beginning.
>
> * Well designed UTF applications should offer complete functionality
> irrespective of whether input has a BOM or not.
>
> * The presence or not of a U+FEFF code point in UTF to UTF renditions
> should *as a default* be propagated if that makes sense to a use case
> e.g. UTF-8 input being rendered to a UTF-16 output simply passes
> through any U+FEFF code point, including if at the beginning.
>
> * In the case of UTF to non-UTF lossy renditions, the U+FEFF code
> point should be ignored during rendition.
>
> * If presented with narrow system encoded data, and there is no better
> way of disambiguating its encoding between non-UTF and UTF-8,
> implementations may take the presence of an initial U+FEFF code point
> as an unreliable hint that the remaining data is in UTF-8.
>
The above points are mostly outside the scope of this paper. Perhaps
that means the paper is not sufficiently clear regarding what is being
discussed. The paper is specifically about use of U+FEFF as a BOM in
UTF-8 and, more specifically, about its use as an encoding signature.

The term "protocol" is used weakly in the paper. It could be referring
to the format of text in memory, in a database field, in a portion of a
network packet, or a file used for a particular purpose. A BOM is not
appropriate in all such cases. For example, one would not, in general,
expect to have to check for a lead U+FEFF character each time a u8
string is consumed in a C++ program.

>
> Now those are far weaker than what you proposed Tom. But my concern
> there is that there are non-UTF encodings where a leading 0xFE 0xFF
> two bytes is perfectly legal. For example, in Latin1, it means the top
> of the file starts with:
>
> þÿ
>
> That is a perfectly valid way of beginning a Latin1 file. I agree that
> the chances are extremely low that the remaining file is not UTF-8,
> but it's not about chances, it's about correctness.

Agreed, and this is one reason for discouraging use of a BOM in UTF-8 as
an encoding signature. This is why the suggested resolution encourages
such use only in legacy situations where there is no other means to
specify encoding. E.g., use of a BOM is a last resort.

Tom.

>
> Niall

Received on 2020-10-13 13:49:32