sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 16 Oct 2020 09:22:46 -0400

On 10/14/20 3:21 PM, Shawn Steele wrote:
>
> How are you going to #include differently encoded source files? I
> don’t see anything in this document that would make it possible to
> #include a file in a different encoding. It’s unclear to me how your
> proposed document could be utilized to enable the scenario you’re
> interested in.
>
My intention is to present various options for WG21 to consider along
with a recommendation. The options that have been identified so far are
listed below. Combinations of some of these options is a possibility.

1. Use of a BOM to indicate UTF-8 encoded source files. This matches
    existing practice for the Microsoft compiler.
2. Use of a #pragma. This matches existing practice
    <https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm>
    for the IBM compiler.
3. Use of a "magic" or "semantic" comment. This matches existing
    practice
    <https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations>
    in Python.
4. Use of filesystem meta data. This is an option for some compilers
    and is being considered for Clang on z/OS.

The goal of this paper is to clarify guidance in the Unicode standard in
order to better inform and justify a recommendation. If the UTC were to
provide a strong recommendation either for or against use of a BOM in
UTF-8 files, that would be a point either in favor or in opposition to
option 1 above. As is, based on my reading and a number of the
responses I've seen, the guidance is murky.

> For mixed-encoding behavior the only thing I could imagine is adding
> some sort of preprocessor #codepage or something to the standard.
> (Which would again take a while to reach critical mass.)
>
Yes, deployment will take time in any case. A goal would be to choose
an option that can be used as an extension for previous C++ standards.
This may rule out option 2 above since some compilers diagnose use of an
unrecognized pragma.

Tom.

> -Shawn
>
> *From:* Tom Honermann <tom_at_[hidden]>
> *Sent:* Tuesday, October 13, 2020 9:47 PM
> *To:* Shawn Steele <Shawn.Steele_at_[hidden]>; J Decker
> <d3ck0r_at_[hidden]>
> *Cc:* sg16_at_[hidden]
> *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a
> BOM as a UTF-8 encoding signature
>
> On 10/13/20 5:19 PM, Shawn Steele wrote:
>
> IMO this document doesn’t solve your problem. The problem of
> encourage use of UTF-8 in C++ source code is a goal that most
> compilers/source code authors/etc are totally onboard with.
>
> The source is already in an indeterminate state. The desired end
> state is to have UTF-8 source code (without BOM), which is
> typically supported. The difficulty is therefore getting from
> point A to point B. As far as “Use Unicode” goes, there’s no
> issue, but trying to specify BOM as a protocol doesn’t really
> solve the problem, particularly in complex environments.
>
> I think there is a misunderstanding. The intent of the paper is to
> provide rationale for the existing discouragement for use of a BOM in
> UTF-8 while acknowledging that, in some cases, it may remain useful.
> My intent is to discourage use of a BOM for UTF-8 encoded source files
> - thereby arguing against standardizing the behavior exhibited by
> Microsoft Visual C++ today.
>
> If the compiler doesn’t handle BOM as expected, then you’ll get
> errors. This can be further complicated by preprocessors,
> #include, resources, etc. If “specifying BOM behavior in Unicode”
> could help solve the problem, then all of the tooling used by
> everyone would have to be updated to handle that (new)
> requirement. If you could get everyone on the same page, they’d
> all use UTF-8, so you wouldn’t need to update the tooling. If you
> don’t need to update the tooling, you wouldn’t need to update the
> best practices for BOMs.
>
> This paper does not propose "specifying BOM behavior in Unicode". If
> you feel that it does, please read it again and let me know what leads
> you to believe that it does.
>
> The tooling isn't the problem. The problem is the existing source
> code that is not UTF-8 encoded or that is UTF-8 encoded with a BOM.
> The deployment challenge is with those existing source files.
> Microsoft Visual C++ is going to continue consuming source files using
> the Active Code Page (ACP) and IBM compilers on EBCDIC platforms are
> going to continue consuming source files using EBCDIC code pages. The
> goal is to provide a mechanism where a UTF-8 encoded source file can
> #include a source file in another encoding or vice versa. Any
> solution for that will require tooling updates (and that is ok).
>
> Personally, I’d prefer if cases like this ignore BOMs (or use them
> to switch to UTF-8); eg: treat BOMs like whitespace. But this
> isn’t a problem solvable by any recommendation by Unicode.
>
> When consuming text as UTF-8, I agree that ignoring a BOM is usually
> the right thing to do and would be the right thing to do when
> consuming source code.
>
> As you noted, many systems provide mechanisms for indicating that
> code is UTF-8 or compiling with UTF-8, regardless of BOM.
>
> Yes, but there is no standard solution, not even a defacto one, for
> consuming differently encoded source files in the same translation unit.
>
> A rather large codebase I’ve been working with has been working to
> remove encoding confusion, and it’s a big task 😁
>
> Yes, yes it is.
>
> Tom.
>
> -Shawn
>
> *From:* Unicode <unicode-bounces_at_[hidden]>
> <mailto:unicode-bounces_at_[hidden]> *On Behalf Of *Tom Honermann
> via Unicode
> *Sent:* Tuesday, October 13, 2020 1:47 PM
> *To:* J Decker <d3ck0r_at_[hidden]> <mailto:d3ck0r_at_[hidden]>;
> Unicode List <unicode_at_[hidden]> <mailto:unicode_at_[hidden]>
> *Cc:* sg16_at_[hidden] <mailto:sg16_at_[hidden]>
> *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of
> a BOM as a UTF-8 encoding signature
>
> On 10/12/20 8:09 PM, J Decker via Unicode wrote:
>
> On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode
> <unicode_at_[hidden] <mailto:unicode_at_[hidden]>> wrote:
>
> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>
> One concern I have, that might lead into rationale for
> the current discouragement,
>
> is that I would hate to see a best practice that
> pushes a BOM into ASCII files.
>
> One of the nice properties of UTF-8 is that a valid
> ASCII file (still very common) is
>
> also a valid UTF-8 file. Changing best practice would
> encourage updating those
>
> files to be no longer ASCII.
>
> Thanks, Alisdair. I think that concern is implicitly
> addressed by the suggested resolutions, but perhaps that
> can be made more clear. One possibility would be to
> modify the "protocol designer" guidelines to address the
> case where a protocol's default encoding is ASCII based
> and to specify that a BOM is only required for UTF-8 text
> that contains non-ASCII characters. Would that be helpful?
>
> 'and to specify that a BOM is only required for UTF-8 ' this
> should NEVER be 'required' or 'must', it shouldn't even be
> 'suggested'; fortunately BOM is just a ZWNBSP, so it's
> certainly a 'may' start with a such and such.
>
> These days the standard 'everything IS utf-8' works really
> well, except in firefox where the charset is required to be
> specified for JS scripts (but that's a bug in that)
>
> EBCDIC should be converted on the edge to internal ascii,
> since, thankfully, this is a niche application and everything
> thinks in ASCII or some derivative thereof.
>
> Byte Order Mark is irrelatvent to utf-8 since bytes are
> ordered in the correct order.
>
> I have run into several editors that have insisted on
> emitted BOM for UTF8 when initially promoted from ASCII, but
> subsequently deleting it doesn't bother anything.
>
> I mostly agree. Please note that the paper suggests use of a BOM
> only as a last resort. The goal is to further discourage its use
> with rationale.
>
>
> I am curious though, what was the actual problem you ran into
> that makes you even consider this modification?
>
> I'm working on improving support for portable C++ source code.
> Today, there is no character encoding that is supported by all C++
> implementations (not even ASCII). I'd like to make UTF-8 that
> commonly supported character encoding. For backward compatibility
> reasons, compilers cannot change their default source code
> character encoding to UTF-8.
>
> Most C++ applications are created from components that have
> different release schedules and that are maintained by different
> organizations. Synchronizing a conversion to UTF-8 across
> dependent projects isn't feasible, nor is converting all of the
> source files used by an application to UTF-8 as simple as just
> running them through 'iconv'. Migration to UTF-8 will therefore
> require an incremental approach for at least some applications,
> though many are likely to find success by simply invoking their
> compiler with the appropriate -everything-is-utf8 option since
> most source files are ASCII.
>
> Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding
> signature and allows differently encoded source files to be used
> in the same translation unit. Support for differently encoded
> source files in the same translation unit is the feature that will
> be needed to enable incremental migration. Normative
> discouragement (with rationale) for use of a BOM by the Unicode
> standard would be helpful to explain why a solution other than a
> BOM (perhaps something like Python's encoding declaration
> <https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations>)
> should be standardized in favor of the existing practice
> demonstrated by Microsoft's solution.
>
> Tom.
>
> J
>
> Tom.
>
> AlisdairM
>
>
>
>
> On Oct 10, 2020, at 14:54, Tom Honermann via SG16
> <sg16_at_[hidden]
> <mailto:sg16_at_[hidden]>> wrote:
>
> Attached is a draft proposal for the Unicode
> standard that intends to clarify the current
> recommendation regarding use of a BOM in UTF-8
> text. This is follow up to discussion on the
> Unicode mailing list
> <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html>
> back in June.
>
> Feedback is welcome. I plan to submit
> <https://www.unicode.org/pending/docsubmit.html>
> this to the UTC in a week or so pending review
> feedback.
>
> Tom.
>
> <Unicode-BOM-guidance.pdf>--
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>
>

Received on 2020-10-16 08:22:51