That is a different paper that I have yet to finish.On Tuesday, 13 October 2020 22:16:14 PDT Tom Honermann via SG16 wrote:Everyone already knows the best practice: “Use UTF-8”. Any resources/effort is going to be getting toward that best practice, not edge cases of legacy behaviors that are offshoots of something that isn’t the desired end state of “use UTF-8”.My goal is exactly to ease migration to that end state. We can't reasonably synchronize a migration of all C++ projects to UTF-8. To get to that end state, we'll have to enable C++ projects to independently transition to UTF-8.I think we can. We just need critical mass. The status quo has remained because there has been nothing forcing a change to status quo. Yes, there's a lot of old codebase that, for example, might have comments written in Chinese or Finnish or something else. But nothing has forced those to update. If the critical mass of software is UTF-8, that will force those codebases to recode. And unlike Microsoft's fixing of their own SDK header files to comply with the language, this is a simple recode operation. It can be done by downstream users, with little to no danger. For my (Qt's) part, we're already doing it: Qt 6.0 is UTF-8, uses UTF-8 in its headers, which have no BOM markers, assumes char is UTF-8 and inserts MSVC option "/utf-8" by default in the buildsystem.Use of a BOM would be one way to get to that desired end state but, as you mentioned, a BOM isn't a great way to identify UTF-8 data. The Unicode standard already admits this with the quoted "not recommended" text, but it lacks the rationale to defend that recommendation or to explain when it may be appropriate to disregard that recommendation. My goal with this paper is to fill that hole. If you don't care for how I've proposed it to be filled, that is certainly ok and alternative suggestions are welcome.Your paper discourages the use of BOM, as I think it should. It does not say what solution tooling should adopt to figure out a source is UTF-8 or not.
Or, if I read it differently, we can apply the Sherlock Holmes solution: when you remove the impossible, whatever remains, however improbably, must be the truth. We don't have any mechanism for tagging files out-of-band to indicate they are UTF-8. Therefore, the only solution is an in-band marker: the BOM. Is that what you're proposing?
No, the goal of this paper is solely to clarify the UTC's position on use of a BOM in UTF-8. Based on both private and public responses I've received, there is an apparent lack of agreement on what the Unicode standard states.
My response to Shawn Steele listed the options up for
consideration; copied below. I haven't settled on a
recommendation yet, but lean towards #3 with room for
implementation-defined behavior to make use of the other three.
Tom.