sg16: Re: [SG16] String literals and diagnostics

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 25 Jan 2021 23:11:01 +0100

On Mon, Jan 25, 2021 at 11:06 PM JF Bastien <cxx_at_[hidden]> wrote:

> Your comment makes me realize that my thinking would probably make more
> sense if I shared its motivation: as a developer writing C++ code, I would
> like to be able to report facts to developers through static_assert /
> deprecated / etc. These facts are usually in ASCII, but some folks use
> other languages and supporting Unicode languages therefore makes sense.
> Those facts might be more nicely displayed with newlines. Those facts don't
> need control characters, even if color, blink, and changing the console's
> window title would be funny. I (as a C++ programmer, not committee member)
> would like this to be uniform between implementations.
>
> This is why I proposed the approach I did: as a user I don't really care
> about phases, QoI, etc. I think it's a useful framing for this problem,
> because agreeing on it helps figure out whether our solution actually helps
> people :)
>

I'm trying to understand your needs, actually whether it is:

1/ I want the compiler to escape all control/invisible characters in the
diagnostic message
2/ If an escape sequence appear verbatim in a string (aka "\u0042\x0042\t"
) then it should appear exactly like that in the diagnostic message even if
those escape sequences represent displayable characters

>
>
> On Mon, Jan 25, 2021 at 1:59 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Mon, Jan 25, 2021 at 10:51 PM JF Bastien <cxx_at_[hidden]> wrote:
>>
>>> I will offer this most wonderful example as a test suite for the
>>> discussion: https://twitter.com/jfbastien/status/1298307325443231744
>>> (sorry not sorry that I tweet randomly bad C++ as bait for y'all)
>>>
>>> I believe that most printable characters should be preserved by the
>>> compiler and printed as-is, Unicode included. I would escape control
>>> characters, null terminators, be careful around RTL but support it, and
>>> support newlines too (because why not?).
>>>
>>
>> I think most of that is QOL - specifically how not printable characters
>> and control characters.
>> Not replacing escape sequences however is probably problematic.
>>
>> First, is it the behavior we want for static assert?
>> And if so, then we want to do something different in static_assert vs
>> attributes ? (the two places where we currently have diagnostics after
>> phase 3, the later of which might become program observable through
>> reflection)
>> So I think the use case for "not replacing escape sequence" should either
>> be super motivated, or QOL?
>>
>> Otherwise I agree
>>
>>
>>>
>>> I think clang's current behavior is pretty much what I'd want, except
>>> for newlines.
>>>
>>> <chair-hat>As Jens said, please send to EWG once SG16 is
>>> happy.</chair-hat>
>>>
>>> On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16_at_[hidden]>
>>> wrote:
>>>
>>>> Hello SG16.
>>>> Following last week's discussion on diagnostic messages, I would like
>>>> to come back to the topic.
>>>>
>>>> What follows specifically exclude the preprocessor, and has no bearing
>>>> on Aaron's paper which I think is strictly a bug fix of a status quo.
>>>>
>>>> For those who are just joining us, the question at hand is: Given static_assert(foo,
>>>> "messsage");, what is the encoding of "message"?
>>>>
>>>> Of course, we know that after phase 1, it will be utf-8 (or otherwise
>>>> representable in utf-8), and upon being displayed or otherwise written out
>>>> somewhere, it will be converted to something the compiler deems suitable
>>>> for that.
>>>>
>>>> So the question really is: is there an intermediate step wherein the
>>>> string is converted to the execution encoding in phase 5?
>>>> There is currently nothing in the standard that says that does not
>>>> happen, all string-literal presumably go through phase 5.
>>>>
>>>> And so, the status-quo leads to implementation divergence such that a
>>>> fix is needed: GCC does the useful thing while MSVC/ICC do the standard
>>>> conforming thing https://godbolt.org/z/MEsbY5
>>>>
>>>>
>>>> What would be the correct behavior is slightly less clear and future
>>>> evolutions make it more complicated.
>>>>
>>>> There is a good argument for saying that everything that looks like a
>>>> string literal is a string literal and therefore, static_assert and
>>>> attributes parameters should go through phase 5, and then converted from
>>>> execution encoding to whatever the encoding used by the compiler for
>>>> diagnostic purposes.
>>>> This has some interesting ramifications, ie static_assert(false, "😀")
>>>> might be ill-formed if the string cannot be encoded to the execution
>>>> encoding in phase 5 (or it might do character replacement in phase 5 under
>>>> the current rules, which is what MSVC does with regular string literals)
>>>> We could then allow static_assert(false, u8"😀") to avoid the above
>>>> issues.
>>>> This first solution has the very clear advantages that it makes the
>>>> model very simple, "" is an execution-encoding encoded string literal, u8""
>>>> is utf-8.
>>>>
>>>> The opposite argument of course is that forcing people to prefix
>>>> everything by u8"" is a bit hostile, and, as it is a departure from the
>>>> current behavior, would break code.
>>>>
>>>>
>>>> We also need to consider possible evolutions of the language, notably
>>>> * diagnostic or compiler output constructed from constant expressions
>>>> at compile time wg21.link/p0596r1
>>>> * reflection on attributes https://wg21.link/p1887r1
>>>> * attribute using constant expressions parameters, although I don't
>>>> know if that has been proposed
>>>>
>>>> so, we can imagine something like
>>>>
>>>> static_assert(false, std::format(...)); which would be neat indeed.
>>>> At this point, we would be very much past phase 5 and it becomes
>>>> critical to have a good model indeed.
>>>>
>>>> I also would like to point out that, at compile time, there is never,
>>>> everything else being equal, a good reason to prefer the execution encoding
>>>> over utf-8.
>>>>
>>>> Given these observations and constraints I think a possible, pragmatic
>>>> and simple course of action would be
>>>>
>>>>
>>>> - Redefine deprecated, nodiscard, static_assert, etc to take a new
>>>> grammar , say "diagnostic-string-literal", which would follow all the rules
>>>> of string literals (concatenation, escape sequence and so forth), but would
>>>> NOT be converted to the execution encoding at any point. Note that this
>>>> does not introduce a new encoding, things stay utf-8.
>>>> - In the future, static_assert and attributes can accept other
>>>> forms which would take constant expressions of u8string_view (or so I hope,
>>>> see wg21.link/p1953r0). Because all of these things require compiler
>>>> support anyway, parsing has no ambiguity)
>>>> - In this model, reflecting on [[deprecated("foo")]] would give a
>>>> utf8 string back, because we decided to make these strings magic for
>>>> convenience and backward compact
>>>>
>>>>
>>>>
>>>> The alternative solution (which is less pragmatic) would be:
>>>>
>>>> - Allow u8 string literals in attributes and static_assert in
>>>> addition of string literals
>>>> - Pass everything through phase 5, always
>>>>
>>>> That second solution, being a breaking change, would require EWG input.
>>>> Its sole benefits is to make the model brutally consistent, which would not
>>>> be without value either.
>>>>
>>>> I'm planning to put all of that in a paper but I would like to hear
>>>> your thoughts before doing so.
>>>>
>>>> Thanks,
>>>> Have a great week,
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>

Received on 2021-01-25 16:11:14