I will offer this most wonderful example as a test suite for the discussion: https://twitter.com/jfbastien/status/1298307325443231744
(sorry not sorry that I tweet randomly bad C++ as bait for y'all)

I believe that most printable characters should be preserved by the compiler and printed as-is, Unicode included. I would escape control characters, null terminators, be careful around RTL but support it, and support newlines too (because why not?).

I think clang's current behavior is pretty much what I'd want, except for newlines.

<chair-hat>As Jens said, please send to EWG once SG16 is happy.</chair-hat>

On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:
Hello SG16.
Following last week's discussion on diagnostic messages, I would like to come back to the topic.

What follows specifically exclude the preprocessor, and has no bearing on Aaron's paper which I think is strictly a bug fix of a status quo.

For those who are just joining us, the question at hand is: Given static_assert(foo, "messsage");, what is the encoding of "message"?

Of course, we know that after phase 1, it will be utf-8 (or otherwise representable in utf-8), and upon being displayed or otherwise written out somewhere, it will be converted to something the compiler deems suitable for that.

So the question really is: is there an intermediate step wherein the string is converted to the execution encoding in phase 5?
There is currently nothing in the standard that says that does not happen, all string-literal presumably go through phase 5.

And so, the status-quo leads to implementation divergence such that a fix is needed: GCC does the useful thing while MSVC/ICC do the standard conforming thing https://godbolt.org/z/MEsbY5 


What would be the correct behavior is slightly less clear and future evolutions make it more complicated.

There is a good argument for saying that everything that looks like a string literal is a string literal and therefore, static_assert and attributes parameters should go through phase 5, and then converted from execution encoding to whatever the encoding used by the compiler for diagnostic purposes.
This has some interesting ramifications, ie static_assert(false, "😀") might be ill-formed if the string cannot be encoded to the execution encoding in phase 5 (or it might do character replacement in phase 5 under the current rules, which is what MSVC does with regular string literals)
We could then allow static_assert(false, u8"😀") to avoid the above issues.
This first solution has the very clear advantages that it makes the model very simple, "" is an execution-encoding encoded string literal, u8"" is utf-8.

The opposite argument of course is that forcing people to prefix everything by u8"" is a bit hostile, and, as it is a departure from the current behavior, would break code.


We also need to consider possible evolutions of the language, notably 
* diagnostic or compiler output constructed from constant expressions at compile time wg21.link/p0596r1
* reflection on attributes https://wg21.link/p1887r1
* attribute using constant expressions parameters, although I don't know if that has been proposed

so, we can imagine something like

static_assert(false, std::format(...)); which would be neat indeed.
At this point, we would be very much past phase 5 and it becomes critical to have a good model indeed. 

I also would like to point out that, at compile time, there is never, everything else being equal, a good reason to prefer the execution encoding over utf-8.

Given these observations and constraints I think a possible, pragmatic and simple course of action would be

  • Redefine deprecated, nodiscard, static_assert, etc to take a new grammar , say "diagnostic-string-literal", which would follow all the rules of string literals (concatenation, escape sequence and so forth), but would NOT be converted to the execution encoding at any point. Note that this does not introduce a new encoding, things stay utf-8.
  • In the future, static_assert and attributes can accept other forms which would take constant expressions of u8string_view (or so I hope, see wg21.link/p1953r0). Because all of these things require compiler support anyway, parsing has no ambiguity)
  • In this model, reflecting on [[deprecated("foo")]] would give a utf8 string back, because we decided to make these strings magic for convenience and backward compact


The alternative solution (which is less pragmatic) would be:
  • Allow u8 string literals in attributes and static_assert in addition of string literals
  • Pass everything through phase 5, always
That second solution, being a breaking change, would require EWG input. Its sole benefits is to make the model brutally consistent, which would not be without value either.

I'm planning to put all of that in a paper but I would like to hear your thoughts before doing so.

Thanks, 
Have a great week, 

Corentin






























--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16