On 10/16/20 11:44 AM, Thiago Macieira via SG16 wrote:
On Tuesday, 13 October 2020 22:16:14 PDT Tom Honermann via SG16 wrote:
Everyone already knows the best practice:  “Use UTF-8”.  Any
resources/effort is going to be getting toward that best practice, not
edge cases of legacy behaviors that are offshoots of something that
isn’t the desired end state of “use UTF-8”.
My goal is exactly to ease migration to that end state.  We can't
reasonably synchronize a migration of all C++ projects to UTF-8. To get
to that end state, we'll have to enable C++ projects to independently
transition to UTF-8.
I think we can. We just need critical mass.

The status quo has remained because there has been nothing forcing a change to 
status quo. Yes, there's a lot of old codebase that, for example, might have 
comments written in Chinese or Finnish or something else. But nothing has 
forced those to update. If the critical mass of software is UTF-8, that will 
force those codebases to recode. And unlike Microsoft's fixing of their own 
SDK header files to comply with the language, this is a simple recode 
operation. It can be done by downstream users, with little to no danger.

For my (Qt's) part, we're already doing it: Qt 6.0 is UTF-8, uses UTF-8 in its 
headers, which have no BOM markers, assumes char is UTF-8 and inserts MSVC 
option "/utf-8" by default in the buildsystem.

Use of a BOM would be one way to get to that desired end state but, as
you mentioned, a BOM isn't a great way to identify UTF-8 data.  The
Unicode standard already admits this with the quoted "not recommended"
text, but it lacks the rationale to defend that recommendation or to
explain when it may be appropriate to disregard that recommendation.  My
goal with this paper is to fill that hole.  If you don't care for how
I've proposed it to be filled, that is certainly ok and alternative
suggestions are welcome.
Your paper discourages the use of BOM, as I think it should. It does not say 
what solution tooling should adopt to figure out a source is UTF-8 or not.
That is a different paper that I have yet to finish.

Or, if I read it differently, we can apply the Sherlock Holmes solution: when 
you remove the impossible, whatever remains, however improbably, must be the 
truth. We don't have any mechanism for tagging files out-of-band to indicate 
they are UTF-8. Therefore, the only solution is an in-band marker: the BOM. Is 
that what you're proposing?

No, the goal of this paper is solely to clarify the UTC's position on use of a BOM in UTF-8.  Based on both private and public responses I've received, there is an apparent lack of agreement on what the Unicode standard states.

My response to Shawn Steele listed the options up for consideration; copied below.  I haven't settled on a recommendation yet, but lean towards #3 with room for implementation-defined behavior to make use of the other three.

  1. Use of a BOM to indicate UTF-8 encoded source files.  This matches existing practice for the Microsoft compiler.
  2. Use of a #pragma.  This matches existing practice for the IBM compiler.
  3. Use of a "magic" or "semantic" comment.  This matches existing practice in Python.
  4. Use of filesystem meta data.  This is an option for some compilers and is being considered for Clang on z/OS.

Tom.