sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 18 Oct 2020 12:29:59 -0400

On 10/16/20 6:02 PM, Thiago Macieira wrote:
> On Friday, 16 October 2020 13:59:15 PDT Tom Honermann wrote:
>> I strongly agree that any recommendation that is not aligned with that
>> approach is doomed to fail. I'd like to enable a solution that supports
>> a best practice in which files that are *not* UTF-8 are marked in some
>> way while acknowledging and supporting exceptional circumstances in
>> which it makes sense to mark UTF-8 files instead, hopefully as a short
>> term solution.
> I can sympathise with the desire, but I have to question whether it's worth
> it.
>
> That requires:
> * tooling be updated to support the markers (not just compilers, but mainly)
> * older source code be updated to have said markers
> * codebases using that old third party be updated to adopt them
>
> Isn't it far more likely for the older source to be updated to UTF-8 instead?
Unfortunately, I don't know. There may be a lot of deployed tools in
older projects that either don't support UTF-8 or expect a non-UTF-8
encoding by default. Such projects may require changes to build
systems, CI/CD pipelines, code review processes, etc...
>
> It seems this marker would be added to source code that can't move to UTF-8
> because some important customer can't make the switch, so it's added to tell
> everyone else that it's some other encoding and those other people need to
> update their compilers. And because said customer won't upgrade, this marker
> has to be a no-op for them, unlike for everyone else.

I see it a little differently. I think the marker is most useful to
enable use of UTF-8 projects by non-UTF-8 projects that can't move to
UTF-8 for "reasons". Such projects could use the marker in a few ways;
1) to mark their own non-UTF-8 source files, 2) to mark UTF-8 source
files, or 3) to mark all source files. I see little motivation for the
third option; defaults are useful. Likewise, the second option competes
with the long-term direction to make UTF-8 the default (thanks to
Corentin for making the point that marking UTF-8 files would continue
the C++ trend of getting the defaults wrong), so that would not be
considered a best practice, but could make sense for projects that have
"ANSI" or EBCDIC encoded source files and complicated non-UTF-8
build/CI/CD processes, but where those processes are less complicated
for third party dependencies. Option 1 is then left as best practice.

I see the goal as being to free up library developers to use UTF-8
without restraint. Yes, many already are.

>
> It's far easier to simply keep the source code 7-bit US-ASCII clean instead
> and cater to all non-IBM needs.
Yes, but 7-bit US-ASCII clean isn't a particularly attractive future
direction. Standards support for UTF-8 (in some way) would allow us to,
in limited ways, break constraints on the basic source character set
without having to rely on universal-character-names. This would help to
reduce the scope of MISRA and Autosar compliance rules that forbid use
of characters outside the basic source character set. Additionally, in
conjunction with P1949 <https://wg21.link/p1949>, this could set the
stage for adopting additional characters as operators in the future.
>
> EBCDIC is a completely different story. EBCDIC is not widespread at all and it
> also has a completely different solution, because mojibake is not possible. A
> compiler failing to make the correct decision on EBCDIC- vs ASCII-based
> encoding will encounter hard, syntax errors. But this is a very limited
> environment, where I expect recoding and other marker techniques are already
> in use. It's also the only environment that retains trigraphs and I say that,
> like them, we shouldn't all be made to suffer -- "the needs of the many".
>
I agree and that may be a reason for the standard to simply mandate
support for UTF-8 via some implementation-defined means.

Tom.

Received on 2020-10-18 11:30:11