On 10/16/20 6:02 PM, Thiago Macieira wrote:

On Friday, 16 October 2020 13:59:15 PDT Tom Honermann wrote:

I strongly agree that any recommendation that is not aligned with that
approach is doomed to fail.  I'd like to enable a solution that supports
a best practice in which files that are *not* UTF-8 are marked in some
way while acknowledging and supporting exceptional circumstances in
which it makes sense to mark UTF-8 files instead, hopefully as a short
term solution.

I can sympathise with the desire, but I have to question whether it's worth 
it.

That requires:
* tooling be updated to support the markers (not just compilers, but mainly)
* older source code be updated to have said markers
* codebases using that old third party be updated to adopt them

Isn't it far more likely for the older source to be updated to UTF-8 instead?

Unfortunately, I don't know. There may be a lot of deployed tools in older projects that either don't support UTF-8 or expect a non-UTF-8 encoding by default. Such projects may require changes to build systems, CI/CD pipelines, code review processes, etc...


It seems this marker would be added to source code that can't move to UTF-8 
because some important customer can't make the switch, so it's added to tell 
everyone else that it's some other encoding and those other people need to 
update their compilers. And because said customer won't upgrade, this marker 
has to be a no-op for them, unlike for everyone else.

I see it a little differently. I think the marker is most useful to enable use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for "reasons". Such projects could use the marker in a few ways; 1) to mark their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to mark all source files. I see little motivation for the third option; defaults are useful. Likewise, the second option competes with the long-term direction to make UTF-8 the default (thanks to Corentin for making the point that marking UTF-8 files would continue the C++ trend of getting the defaults wrong), so that would not be considered a best practice, but could make sense for projects that have "ANSI" or EBCDIC encoded source files and complicated non-UTF-8 build/CI/CD processes, but where those processes are less complicated for third party dependencies. Option 1 is then left as best practice.

I see the goal as being to free up library developers to use UTF-8 without restraint. Yes, many already are.


It's far easier to simply keep the source code 7-bit US-ASCII clean instead 
and cater to all non-IBM needs.

Yes, but 7-bit US-ASCII clean isn't a particularly attractive future direction. Standards support for UTF-8 (in some way) would allow us to, in limited ways, break constraints on the basic source character set without having to rely on universal-character-names. This would help to reduce the scope of MISRA and Autosar compliance rules that forbid use of characters outside the basic source character set. Additionally, in conjunction with P1949, this could set the stage for adopting additional characters as operators in the future.


EBCDIC is a completely different story. EBCDIC is not widespread at all and it 
also has a completely different solution, because mojibake is not possible. A 
compiler failing to make the correct decision on EBCDIC- vs ASCII-based 
encoding will encounter hard, syntax errors. But this is a very limited 
environment, where I expect recoding and other marker techniques are already 
in use. It's also the only environment that retains trigraphs and I say that, 
like them, we shouldn't all be made to suffer -- "the needs of the many".

I agree and that may be a reason for the standard to simply mandate support for UTF-8 via some implementation-defined means.

Tom.