sg16: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 18 Oct 2020 22:30:33 -0400

On 10/18/20 5:18 PM, Alisdair Meredith wrote:
> There are a variety of problems and assumptions coupled together
> throughout this thread - so I am going to try to tease apart my own
> assumptions and concerns, to raise some ideas I have not heard
> addressed.
>
> My assumption is that any given environment will assume a default
> encoding for all source files, potentially with an override on the
> command line to change the universal default for that whole invocation.
>
> Problems arise when a source file does not match the assumed encoding.
>
> Problems are significanty harder when multiple files are involved, tyipcally
> headers, that may vary in their encoding (although this is impractical
> enough that it should rarely happen).
>
> Modules go a long way to solving some of these issues, as while all the
> issues remain for building a module, there are no text encoding concerns
> when importing one.
I agree with all the above.
> Mark-up in a source file is superficially appealing, but suffers from the
> need to know the encoding to read that markup *before* reading the
> file, at which point the markup is redundant.
This problem exists, but is easily surmounted. HTML and Python both
support encoding declarations and address this problem by limiting where
the encoding declaration can appear (in the first 1024 bytes for HTML
and in the first or second line for Python). Preceding text can be
limited to members of the basic source character set, at least for
characters outside of comments. For almost all implementations,
assuming ASCII when scanning for an encoding declaration will suffice;
complications only arise for EBCDIC environments and even there, an
ASCII scan could be performed, perhaps only if scanning/lexing fails
with EBCDIC.
>
> In a world where all source homogenously uses the same encoding, the
> only issue is telling the tool chain which default encoding to use.
>
> It strikes me that in a mixed encoding world, for practical purposes,
> Whole subsystems from a single supplier should be homogeneous, using
> asingle encoding. The practical problem comes combining source files
> (headers) from multiple vendors in a single compile. Specifically, I see
> the problem is with #include-ing a file with the “wrong” encoding.
Yes, agreed.
>
> Therefore, I wonder if we have looked into adding a markup for #include
> to specify a foreign encoding for the file that it is about to include? This
> seems the appropriate place to specify a transition to a different encoding
> without requiring magic guesses at the start of phase 1.

I have considered this, but it creates additional complexities. Consider
a UTF-8 project with headers 'x.h' and 'y.h' where 'x.h' has a #include
for 'y.h'. The normal #include is desired there because use of a
#include_utf8 would be a case of the wrong default. Now consider that
project being used by a Windows-1252 project in which 'x.h' is included
by 'a.h'. If 'a.h' uses #include_utf8 to include 'x.h', what happens
when 'x.h' includes 'y.h'? Is the encoding applied transitively? If
so, when does it stop?

I think a better solution in this vein would be for an implementation to
support associating an encoding with an include path. Include paths are
outside the scope of the standard, so this isn't something that could be
standardized, but it could be used as part of an implementation-defined
mechanism used to enable mixed encoding support.

Tom.

>
> AlisdairM
>
>> On Oct 18, 2020, at 12:44, Thiago Macieira via SG16 <sg16_at_[hidden]> wrote:
>>
>> On Sunday, 18 October 2020 09:29:59 PDT Tom Honermann wrote:
>>> I see it a little differently. I think the marker is most useful to enable
>>> use of UTF-8 projects by non-UTF-8 projects that can't move to UTF-8 for
>>> "reasons". Such projects could use the marker in a few ways; 1) to mark
>>> their own non-UTF-8 source files, 2) to mark UTF-8 source files, or 3) to
>>> mark all source files. I see little motivation for the third option;
>>> defaults are useful. Likewise, the second option competes with the
>>> long-term direction to make UTF-8 the default (thanks to Corentin for
>>> making the point that marking UTF-8 files would continue the C++ trend of
>>> getting the defaults wrong), so that would not be considered a best
>>> practice, but could make sense for projects that have "ANSI" or EBCDIC
>>> encoded source files and complicated non-UTF-8 build/CI/CD processes, but
>>> where those processes are less complicated for third party dependencies.
>>> Option 1 is then left as best practice.
>> That I agree with.
>>
>> To be clear what I'm agreeing with: UTF-8 source files aren't generally marked
>> in any special way, but non-UTF-8 source files get a marker that indicates
>> their actual encoding. That is aligned with the general direction of not using
>> BOM and of defaulting to UTF-8 unless otherwise instructed.
>>
>> Corollary:
>>
>> Said marker must be a no-op for backwards compatibility as we're talking about
>> tools and environment that "for reasons" can't update to tools that can handle
>> UTF-8 properlry. That means we can't have a C++ attribute:
>> [[encoding: cp1252]]
>>
>> Ditto for a pragma:
>>
>> #pragma encoding "cp1252"
>>
>> That leaves out-of-band markers (filesystem metadata) or specially-formatted
>> comments.
>> --
>> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>> Software Architect - Intel DPG Cloud Engineering
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-10-18 21:30:38