sg16: Re: [SG16] On the character encoding of diagnostic text

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 3 Sep 2020 21:04:07 -0400

This is also generally true. We can not mandate that output is sensible.
There is no way of ensuring that octets received are going to be
interpreted correctly, or even to ask how they are going to be interpreted.

The most for output we can ask is that octets pushed out are the ones that
are expected.

I think it is worth noting, because it is surprising, that this is terribly
difficult in places we are mandating "should" or "shall", and make sure the
treaty between compiler implementors and users is something every one
understands.

I am also mindful of the recent discussion of assignments to string from
int. The C++ standard is not the place to mandate warnings. I suspect that
the C standard is not either, given how many more C implementations there
are.

I think the existing wording is overly ambitious, and that's sufficient,
given someone wrote an official paper, to reconsider. We should not get
ambitious, though. A "diagnostic character set" is Overkill and uncalled
for.

This might be something SG15 could pick up, but I would want to defer that
until after modules, lest we further embarrass ourselves.

On Thu, Sep 3, 2020, 19:00 JF Bastien via SG16 <sg16_at_[hidden]>
wrote:

>
>
> On Thu, Sep 3, 2020 at 3:36 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 9/2/20 5:33 AM, Peter Brett via SG16 wrote:
>>
>> Hi all,
>>
>>
>>
>> We allow Unicode identifiers (if/when P1949 is adopted, UAX31
>> identifiers). Implementations will therefore need to have a mechanism for
>> communicating those identifiers to the user via their diagnostics. Let us
>> assume that such mechanism exists as a necessary implementation detail of
>> any reasonable C++ implementation.
>>
>>
>>
>> By the point at which static_assert() is evaluated, its string argument
>> will have already been converted to its associated implementation-defined
>> literal encoding. For some implementations, this may be a lossy conversion.
>>
>>
>>
>> I am wary of mandating specific handling of this in the standard because
>> the way in which diagnostics are communicated to the user seems to be
>> something that really should be a quality of implementation issue.
>>
>>
>>
>> If we were to adjust the standard, then the adjustment should not
>> preclude constexpr computation of the static_assert message, in
>> anticipation of reflection making it possible to format type names into it
>> with constexpr std::format.
>>
>>
>>
>> static_assert(std::is_base_of_v<MyBase, Arg>,
>> std::format("Cannot my_cast to {} because {} is not derived from
>> MyBase",
>> /* reflection expressions here */));
>>
>>
>>
>> We must not confuse 2 separate concerns:
>>
>>
>>
>> 1. Whether implementations correctly process strings from their
>> internal representation for display in diagnostic messages
>> 2. Whether implementations correctly handle situations in which the
>> literal encoding and the encoding required for displaying diagnostic
>> messages is different.
>>
>>
>>
>> I am strongly opposed to a solution that restricts static_assert()
>> messages to the basic source character set.
>>
>> I would also be strongly opposed to a solution that prohibits characters
>> outside of the basic source character set, but I think it would be
>> reasonable to specify that characters outside the basic source character
>> set may be subject to substitution (potentially lossy), presentation in
>> non-glyph form (as a UCN), or perhaps even dropped (mildly opposed). Is
>> that a view point that you could support?
>>
>>
> I don't think we should specify what happens to the string. Rather we
> should specify what kind of string literals are accepted (and I'd accept
> any valid string literal).
>
> First, what happens to diagnostics is outside the abstract machine, we
> don't legislate that. Second, it's not the source character set nor is it
> the execution one. What I mean by this is that the source character set is
> what the compiler consumes and my editor shows, but diagnostics are what my
> shell shows (that's not the compiler, nor is it the editor), but it can be
> in an IDE. Imagine that I run clang in my favorite PDP-11 shell emulator...
> clang might be nice to check if Unicode is supported and then escape what's
> not supported, but does the Standard need to say anything? Now imagine I
> pipe stderr to /dev/null, have I now made my compiler non-conformant? What
> if I pipe it to a file? I can't see it unless I open the file... is it
> still conformant? What about diagnostics in an IDE, where I only see
> diagnostics for the code currently open in the IDE, the others are
> "hidden". Say the IDE colors the diagnostics, is it still conformant? etc.
> The Standard doesn't care about any of this, it's not useful for us to
> care, let's not say anything. Trying to say something is legislating away
> implementation freedom, let's just trust that implementation aren't
> adversarial and they're actually trying to help users.
>
>
>
>> Tom.
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Peter
>>
>>
>>
>>
>>
>> *From:* SG16 <sg16-bounces_at_[hidden]>
>> <sg16-bounces_at_[hidden]> *On Behalf Of *Martinho Fernandes via
>> SG16
>> *Sent:* 01 September 2020 18:16
>> *To:* sg16_at_[hidden]
>> *Cc:* Martinho Fernandes <rmf_at_[hidden]> <rmf_at_[hidden]>
>> *Subject:* Re: [SG16] On the character encoding of diagnostic text
>>
>>
>>
>> EXTERNAL MAIL
>>
>>
>>
>> On Tue, Sep 1, 2020 at 7:05 PM Aaron Ballman via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>> On Tue, Sep 1, 2020 at 12:08 PM Alisdair Meredith via SG16
>> <sg16_at_[hidden]> wrote:
>> >
>> > For a cross compiler, the basic execution character set should
>> correspond to the target platform, but the diagnostics character set should
>> be for the host?
>>
>> That matches my understanding.
>>
>> I suppose a question I could add is whether anyone would like to see a
>> new character set introduced for diagnostics. My intuition is that it
>> would be a pretty heavy hammer to bring to bear and that the basic
>> source character set is probably Good Enough (tm).
>>
>>
>>
>> Wouldn't these diagnostics be the place people are more likely to use
>> non-basic source characters, though? When it comes to identifiers people
>> will sometimes compromise and restrict themselves and e.g. avoid
>> diacritics, but in error messages I feel like it makes a lot more sense to
>> want to write with the full expression of their native script.
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-09-03 20:07:48