On 10/12/20 8:09 PM, J Decker via Unicode wrote:

On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode <unicode@unicode.org> wrote:

On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:

One concern I have, that might lead into rationale for the current discouragement,
is that I would hate to see a best practice that pushes a BOM into ASCII files.

One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is

also a valid UTF-8 file. Changing best practice would encourage updating those

files to be no longer ASCII.

Thanks, Alisdair. I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear. One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters. Would that be helpful?

'and to specify that a BOM is only required for UTF-8 ' this should NEVER be 'required' or 'must', it shouldn't even be 'suggested'; fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with a such and such.

These days the standard 'everything IS utf-8' works really well, except in firefox where the charset is required to be specified for JS scripts (but that's a bug in that)

EBCDIC should be converted on the edge to internal ascii, since, thankfully, this is a niche application and everything thinks in ASCII or some derivative thereof.

Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the correct order.

I have run into several editors that have insisted on emitted BOM for UTF8 when initially promoted from ASCII, but subsequently deleting it doesn't bother anything.

I mostly agree. Please note that the paper suggests use of a BOM only as a last resort. The goal is to further discourage its use with rationale.

I am curious though, what was the actual problem you ran into that makes you even consider this modification?

I'm working on improving support for portable C++ source code. Today, there is no character encoding that is supported by all C++ implementations (not even ASCII). I'd like to make UTF-8 that commonly supported character encoding. For backward compatibility reasons, compilers cannot change their default source code character encoding to UTF-8.

Most C++ applications are created from components that have different release schedules and that are maintained by different organizations. Synchronizing a conversion to UTF-8 across dependent projects isn't feasible, nor is converting all of the source files used by an application to UTF-8 as simple as just running them through 'iconv'. Migration to UTF-8 will therefore require an incremental approach for at least some applications, though many are likely to find success by simply invoking their compiler with the appropriate -everything-is-utf8 option since most source files are ASCII.

Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding signature and allows differently encoded source files to be used in the same translation unit. Support for differently encoded source files in the same translation unit is the feature that will be needed to enable incremental migration. Normative discouragement (with rationale) for use of a BOM by the Unicode standard would be helpful to explain why a solution other than a BOM (perhaps something like Python's encoding declaration) should be standardized in favor of the existing practice demonstrated by Microsoft's solution.

Tom.

J

Tom.

AlisdairM

On Oct 10, 2020, at 14:54, Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June.

Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback.

Tom.

<Unicode-BOM-guidance.pdf>--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16