sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Tue, 13 Oct 2020 18:00:17 +0100

On 13/10/2020 15:22, Tom Honermann wrote:

>> 2. For me, if the input has a BOM, I'm requesting that the output
>> should have a BOM. All that well designed applications need to do is
>> ignore BOM. Your rules below seem to overly favour specially treating
>> BOM, in my opinion, when passthrough and ignoring it, if it is or is
>> not present, seem to me better.
>
> That seems reasonable for pass through protocols such as for a file
> posting/copying service. How does the following update sound?

I'm happy to defer to SG16's opinion on this if I'm an opinion of one.
However, here would be my protocol recommendations:

* U+FEFF is allowed anywhere in a UTF text, including at the very beginning.

* Well designed UTF applications should offer complete functionality
irrespective of whether input has a BOM or not.

* The presence or not of a U+FEFF code point in UTF to UTF renditions
should *as a default* be propagated if that makes sense to a use case
e.g. UTF-8 input being rendered to a UTF-16 output simply passes through
any U+FEFF code point, including if at the beginning.

* In the case of UTF to non-UTF lossy renditions, the U+FEFF code point
should be ignored during rendition.

* If presented with narrow system encoded data, and there is no better
way of disambiguating its encoding between non-UTF and UTF-8,
implementations may take the presence of an initial U+FEFF code point as
an unreliable hint that the remaining data is in UTF-8.

Now those are far weaker than what you proposed Tom. But my concern
there is that there are non-UTF encodings where a leading 0xFE 0xFF two
bytes is perfectly legal. For example, in Latin1, it means the top of
the file starts with:

þÿ

That is a perfectly valid way of beginning a Latin1 file. I agree that
the chances are extremely low that the remaining file is not UTF-8, but
it's not about chances, it's about correctness.

Niall

Received on 2020-10-13 12:01:09