Subject: Re: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature
From: Tom Honermann (tom_at_[hidden])
Date: 2020-10-16 08:22:46
On 10/14/20 3:21 PM, Shawn Steele wrote:
> How are you going to #include differently encoded source files?Â I
> don't see anything in this document that would make it possible to
> #include a file in a different encoding.Â It's unclear to me how your
> proposed document could be utilized to enable the scenario you're
> interested in.
My intention is to present various options for WG21 to consider along
with a recommendation.Â The options that have been identified so far are
listed below.Â Combinations of some of these options is a possibility.
1. Use of a BOM to indicate UTF-8 encoded source files.Â This matches
existing practice for the Microsoft compiler.
2. Use of a #pragma.Â This matches existing practice
for the IBM compiler.
3. Use of a "magic" or "semantic" comment.Â This matches existing
4. Use of filesystem meta data.Â This is an option for some compilers
and is being considered for Clang on z/OS.
The goal of this paper is to clarify guidance in the Unicode standard in
order to better inform and justify a recommendation. If the UTC were to
provide a strong recommendation either for or against use of a BOM in
UTF-8 files, that would be a point either in favor or in opposition to
option 1 above.Â As is, based on my reading and a number of the
responses I've seen, the guidance is murky.
> For mixed-encoding behavior the only thing I could imagine is adding
> some sort of preprocessor #codepage or something to the standard.Â
> (Which would again take a while to reach critical mass.)
Yes, deployment will take time in any case.Â A goal would be to choose
an option that can be used as an extension for previous C++ standards.Â
This may rule out option 2 above since some compilers diagnose use of an
> *From:* Tom Honermann <tom_at_[hidden]>
> *Sent:* Tuesday, October 13, 2020 9:47 PM
> *To:* Shawn Steele <Shawn.Steele_at_[hidden]>; J Decker
> *Cc:* sg16_at_[hidden]
> *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a
> BOM as a UTF-8 encoding signature
> On 10/13/20 5:19 PM, Shawn Steele wrote:
> IMO this document doesn't solve your problem.Â The problem of
> encourage use of UTF-8 in C++ source code is a goal that most
> compilers/source code authors/etc are totally onboard with.
> The source is already in an indeterminate state.Â The desired end
> state is to have UTF-8 source code (without BOM), which is
> typically supported.Â The difficulty is therefore getting from
> point A to point B.Â As far as "Use Unicode" goes, there's no
> issue, but trying to specify BOM as a protocol doesn't really
> solve the problem, particularly in complex environments.
> I think there is a misunderstanding.Â The intent of the paper is to
> provide rationale for the existing discouragement for use of a BOM in
> UTF-8 while acknowledging that, in some cases, it may remain useful.Â
> My intent is to discourage use of a BOM for UTF-8 encoded source files
> - thereby arguing against standardizing the behavior exhibited by
> Microsoft Visual C++ today.
> If the compiler doesn't handle BOM as expected, then you'll get
> errors.Â This can be further complicated by preprocessors,
> #include, resources, etc.Â If "specifying BOM behavior in Unicode"
> could help solve the problem, then all of the tooling used by
> everyone would have to be updated to handle that (new)
> requirement.Â If you could get everyone on the same page, they'd
> all use UTF-8, so you wouldn't need to update the tooling.Â If you
> don't need to update the tooling, you wouldn't need to update the
> best practices for BOMs.
> This paper does not propose "specifying BOM behavior in Unicode".Â If
> you feel that it does, please read it again and let me know what leads
> you to believe that it does.
> The tooling isn't the problem.Â The problem is the existing source
> code that is not UTF-8 encoded or that is UTF-8 encoded with a BOM.Â
> The deployment challenge is with those existing source files.Â
> Microsoft Visual C++ is going to continue consuming source files using
> the Active Code Page (ACP) and IBM compilers on EBCDIC platforms are
> going to continue consuming source files using EBCDIC code pages.Â The
> goal is to provide a mechanism where a UTF-8 encoded source file can
> #include a source file in another encoding or vice versa.Â Any
> solution for that will require tooling updates (and that is ok).
> Personally, I'd prefer if cases like this ignore BOMs (or use them
> to switch to UTF-8); eg: treat BOMs like whitespace.Â But this
> isn't a problem solvable by any recommendation by Unicode.
> When consuming text as UTF-8, I agree that ignoring a BOM is usually
> the right thing to do and would be the right thing to do when
> consuming source code.
> As you noted, many systems provide mechanisms for indicating that
> code is UTF-8 or compiling with UTF-8, regardless of BOM.
> Yes, but there is no standard solution, not even a defacto one, for
> consuming differently encoded source files in the same translation unit.
> A rather large codebase I've been working with has been working to
> remove encoding confusion, and it's a big task ð
> Yes, yes it is.
> *From:* Unicode <unicode-bounces_at_[hidden]>
> <mailto:unicode-bounces_at_[hidden]> *On Behalf Of *Tom Honermann
> via Unicode
> *Sent:* Tuesday, October 13, 2020 1:47 PM
> *To:* J Decker <d3ck0r_at_[hidden]> <mailto:d3ck0r_at_[hidden]>;
> Unicode List <unicode_at_[hidden]> <mailto:unicode_at_[hidden]>
> *Cc:* sg16_at_[hidden] <mailto:sg16_at_[hidden]>
> *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of
> a BOM as a UTF-8 encoding signature
> On 10/12/20 8:09 PM, J Decker via Unicode wrote:
> On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode
> <unicode_at_[hidden] <mailto:unicode_at_[hidden]>> wrote:
> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
> One concern I have, that might lead into rationale for
> the current discouragement,
> is that I would hate to see a best practice that
> pushes a BOM into ASCII files.
> One of the nice properties of UTF-8 is that a valid
> ASCII file (still very common) is
> also a valid UTF-8 file. Changing best practice would
> encourage updating those
> files to be no longer ASCII.
> Thanks, Alisdair.Â I think that concern is implicitly
> addressed by the suggested resolutions, but perhaps that
> can be made more clear.Â One possibility would be to
> modify the "protocol designer" guidelines to address the
> case where a protocol's default encoding is ASCII based
> and to specify that a BOM is only required for UTF-8 text
> that contains non-ASCII characters.Â Would that be helpful?
> 'and to specify that a BOM is only required for UTF-8 'Â this
> should NEVER be 'required' or 'must', it shouldn't even be
> 'suggested'; fortunately BOM is just a ZWNBSP, so it's
> certainly a 'may' start with a such and such.
> These days the standard 'everything IS utf-8' works really
> well, except in firefox where the charset is required to be
> specified for JS scripts (but that's a bug in that)
> EBCDIC should be converted on the edge to internal ascii,
> since, thankfully, this is a niche application and everything
> thinks in ASCII or some derivative thereof.
> Byte Order Mark is irrelatvent to utf-8 since bytes are
> ordered in the correct order.
> I have run into several editors that have insisted on
> emittedÂ BOM for UTF8 when initially promoted from ASCII, but
> subsequently deleting it doesn't bother anything.
> I mostly agree.Â Please note that the paper suggests use of a BOM
> only as a last resort.Â The goal is to further discourage its use
> with rationale.
> I am curious though, what was the actual problem you ran into
> that makes you even consider this modification?
> I'm working on improving support for portable C++ source code.Â
> Today, there is no character encoding that is supported by all C++
> implementations (not even ASCII).Â I'd like to make UTF-8 that
> commonly supported character encoding.Â For backward compatibility
> reasons, compilers cannot change their default source code
> character encoding to UTF-8.
> Most C++ applications are created from components that have
> different release schedules and that are maintained by different
> organizations.Â Synchronizing a conversion to UTF-8 across
> dependent projects isn't feasible, nor is converting all of the
> source files used by an application to UTF-8 as simple as just
> running them through 'iconv'. Migration to UTF-8 will therefore
> require an incremental approach for at least some applications,
> though many are likely to find success by simply invoking their
> compiler with the appropriate -everything-is-utf8 option since
> most source files are ASCII.
> Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding
> signature and allows differently encoded source files to be used
> in the same translation unit.Â Support for differently encoded
> source files in the same translation unit is the feature that will
> be needed to enable incremental migration.Â Normative
> discouragement (with rationale) for use of a BOM by the Unicode
> standard would be helpful to explain why a solution other than a
> BOM (perhaps something like Python's encoding declaration
> should be standardized in favor of the existing practice
> demonstrated by Microsoft's solution.
> On Oct 10, 2020, at 14:54, Tom Honermann via SG16
> <mailto:sg16_at_[hidden]>> wrote:
> Attached is a draft proposal for the Unicode
> standard that intends to clarify the current
> recommendation regarding use of a BOM in UTF-8
> text. This is follow up to discussion on the
> Unicode mailing list
> back in June.
> Feedback is welcome.Â I plan to submit
> this to the UTC in a week or so pending review
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
SG16 list run by email@example.com