sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Alisdair Meredith <alisdairm_at_[hidden]>
Date: Sun, 18 Oct 2020 17:18:53 -0400

There are a variety of problems and assumptions coupled together
throughout this thread - so I am going to try to tease apart my own
assumptions and concerns, to raise some ideas I have not heard
addressed.

My assumption is that any given environment will assume a default
encoding for all source files, potentially with an override on the
command line to change the universal default for that whole invocation.

Problems arise when a source file does not match the assumed encoding.

Problems are significanty harder when multiple files are involved, tyipcally
headers, that may vary in their encoding (although this is impractical
enough that it should rarely happen).

Modules go a long way to solving some of these issues, as while all the
issues remain for building a module, there are no text encoding concerns
when importing one.

Mark-up in a source file is superficially appealing, but suffers from the
need to know the encoding to read that markup *before* reading the
file, at which point the markup is redundant.

In a world where all source homogenously uses the same encoding, the
only issue is telling the tool chain which default encoding to use.

It strikes me that in a mixed encoding world, for practical purposes,
Whole subsystems from a single supplier should be homogeneous, using
asingle encoding. The practical problem comes combining source files
(headers) from multiple vendors in a single compile. Specifically, I see
the problem is with #include-ing a file with the “wrong” encoding.

Therefore, I wonder if we have looked into adding a markup for #include
to specify a foreign encoding for the file that it is about to include? This
seems the appropriate place to specify a transition to a different encoding
without requiring magic guesses at the start of phase 1.

AlisdairM

> On Oct 18, 2020, at 12:44, Thiago Macieira via SG16 <sg16_at_[hidden]> wrote:
>
> On Sunday, 18 October 2020 09:29:59 PDT Tom Honermann wrote:
>> I see it a little differently. I think the marker is most useful to enable
>> use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for
>> "reasons". Such projects could use the marker in a few ways; 1) to mark
>> their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to
>> mark all source files. I see little motivation for the third option;
>> defaults are useful. Likewise, the second option competes with the
>> long-term direction to make UTF-8 the default (thanks to Corentin for
>> making the point that marking UTF-8 files would continue the C++ trend of
>> getting the defaults wrong), so that would not be considered a best
>> practice, but could make sense for projects that have "ANSI" or EBCDIC
>> encoded source files and complicated non-UTF-8 build/CI/CD processes, but
>> where those processes are less complicated for third party dependencies.
>> Option 1 is then left as best practice.
>
> That I agree with.
>
> To be clear what I'm agreeing with: UTF-8 source files aren't generally marked
> in any special way, but non-UTF-8 source files get a marker that indicates
> their actual encoding. That is aligned with the general direction of not using
> BOM and of defaulting to UTF-8 unless otherwise instructed.
>
> Corollary:
>
> Said marker must be a no-op for backwards compatibility as we're talking about
> tools and environment that "for reasons" can't update to tools that can handle
> UTF-8 properlry. That means we can't have a C++ attribute:
> [[encoding: cp1252]]
>
> Ditto for a pragma:
>
> #pragma encoding "cp1252"
>
> That leaves out-of-band markers (filesystem metadata) or specially-formatted
> comments.
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel DPG Cloud Engineering
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-10-18 16:18:59