sg16: Re: [SG16] String literals and diagnostics

From: JF Bastien <cxx_at_[hidden]>
Date: Mon, 25 Jan 2021 14:06:41 -0800

Your comment makes me realize that my thinking would probably make more
sense if I shared its motivation: as a developer writing C++ code, I would
like to be able to report facts to developers through static_assert /
deprecated / etc. These facts are usually in ASCII, but some folks use
other languages and supporting Unicode languages therefore makes sense.
Those facts might be more nicely displayed with newlines. Those facts don't
need control characters, even if color, blink, and changing the console's
window title would be funny. I (as a C++ programmer, not committee member)
would like this to be uniform between implementations.

This is why I proposed the approach I did: as a user I don't really care
about phases, QoI, etc. I think it's a useful framing for this problem,
because agreeing on it helps figure out whether our solution actually helps
people :)

On Mon, Jan 25, 2021 at 1:59 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Mon, Jan 25, 2021 at 10:51 PM JF Bastien <cxx_at_[hidden]> wrote:
>
>> I will offer this most wonderful example as a test suite for the
>> discussion: https://twitter.com/jfbastien/status/1298307325443231744
>> (sorry not sorry that I tweet randomly bad C++ as bait for y'all)
>>
>> I believe that most printable characters should be preserved by the
>> compiler and printed as-is, Unicode included. I would escape control
>> characters, null terminators, be careful around RTL but support it, and
>> support newlines too (because why not?).
>>
>
> I think most of that is QOL - specifically how not printable characters
> and control characters.
> Not replacing escape sequences however is probably problematic.
>
> First, is it the behavior we want for static assert?
> And if so, then we want to do something different in static_assert vs
> attributes ? (the two places where we currently have diagnostics after
> phase 3, the later of which might become program observable through
> reflection)
> So I think the use case for "not replacing escape sequence" should either
> be super motivated, or QOL?
>
> Otherwise I agree
>
>
>>
>> I think clang's current behavior is pretty much what I'd want, except for
>> newlines.
>>
>> <chair-hat>As Jens said, please send to EWG once SG16 is
>> happy.</chair-hat>
>>
>> On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>> Hello SG16.
>>> Following last week's discussion on diagnostic messages, I would like to
>>> come back to the topic.
>>>
>>> What follows specifically exclude the preprocessor, and has no bearing
>>> on Aaron's paper which I think is strictly a bug fix of a status quo.
>>>
>>> For those who are just joining us, the question at hand is: Given static_assert(foo,
>>> "messsage");, what is the encoding of "message"?
>>>
>>> Of course, we know that after phase 1, it will be utf-8 (or otherwise
>>> representable in utf-8), and upon being displayed or otherwise written out
>>> somewhere, it will be converted to something the compiler deems suitable
>>> for that.
>>>
>>> So the question really is: is there an intermediate step wherein the
>>> string is converted to the execution encoding in phase 5?
>>> There is currently nothing in the standard that says that does not
>>> happen, all string-literal presumably go through phase 5.
>>>
>>> And so, the status-quo leads to implementation divergence such that a
>>> fix is needed: GCC does the useful thing while MSVC/ICC do the standard
>>> conforming thing https://godbolt.org/z/MEsbY5
>>>
>>>
>>> What would be the correct behavior is slightly less clear and future
>>> evolutions make it more complicated.
>>>
>>> There is a good argument for saying that everything that looks like a
>>> string literal is a string literal and therefore, static_assert and
>>> attributes parameters should go through phase 5, and then converted from
>>> execution encoding to whatever the encoding used by the compiler for
>>> diagnostic purposes.
>>> This has some interesting ramifications, ie static_assert(false, "😀")
>>> might be ill-formed if the string cannot be encoded to the execution
>>> encoding in phase 5 (or it might do character replacement in phase 5 under
>>> the current rules, which is what MSVC does with regular string literals)
>>> We could then allow static_assert(false, u8"😀") to avoid the above
>>> issues.
>>> This first solution has the very clear advantages that it makes the
>>> model very simple, "" is an execution-encoding encoded string literal, u8""
>>> is utf-8.
>>>
>>> The opposite argument of course is that forcing people to prefix
>>> everything by u8"" is a bit hostile, and, as it is a departure from the
>>> current behavior, would break code.
>>>
>>>
>>> We also need to consider possible evolutions of the language, notably
>>> * diagnostic or compiler output constructed from constant expressions at
>>> compile time wg21.link/p0596r1
>>> * reflection on attributes https://wg21.link/p1887r1
>>> * attribute using constant expressions parameters, although I don't know
>>> if that has been proposed
>>>
>>> so, we can imagine something like
>>>
>>> static_assert(false, std::format(...)); which would be neat indeed.
>>> At this point, we would be very much past phase 5 and it becomes
>>> critical to have a good model indeed.
>>>
>>> I also would like to point out that, at compile time, there is never,
>>> everything else being equal, a good reason to prefer the execution encoding
>>> over utf-8.
>>>
>>> Given these observations and constraints I think a possible, pragmatic
>>> and simple course of action would be
>>>
>>>
>>> - Redefine deprecated, nodiscard, static_assert, etc to take a new
>>> grammar , say "diagnostic-string-literal", which would follow all the rules
>>> of string literals (concatenation, escape sequence and so forth), but would
>>> NOT be converted to the execution encoding at any point. Note that this
>>> does not introduce a new encoding, things stay utf-8.
>>> - In the future, static_assert and attributes can accept other forms
>>> which would take constant expressions of u8string_view (or so I hope, see
>>> wg21.link/p1953r0). Because all of these things require compiler support
>>> anyway, parsing has no ambiguity)
>>> - In this model, reflecting on [[deprecated("foo")]] would give a
>>> utf8 string back, because we decided to make these strings magic for
>>> convenience and backward compact
>>>
>>>
>>>
>>> The alternative solution (which is less pragmatic) would be:
>>>
>>> - Allow u8 string literals in attributes and static_assert in
>>> addition of string literals
>>> - Pass everything through phase 5, always
>>>
>>> That second solution, being a breaking change, would require EWG input.
>>> Its sole benefits is to make the model brutally consistent, which would not
>>> be without value either.
>>>
>>> I'm planning to put all of that in a paper but I would like to hear your
>>> thoughts before doing so.
>>>
>>> Thanks,
>>> Have a great week,
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2021-01-25 16:06:54