sg16: [SG16] String literals and diagnostics

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 25 Jan 2021 14:01:54 +0100

Hello SG16.
Following last week's discussion on diagnostic messages, I would like to
come back to the topic.

What follows specifically exclude the preprocessor, and has no bearing on
Aaron's paper which I think is strictly a bug fix of a status quo.

For those who are just joining us, the question at hand is: Given
static_assert(foo,
"messsage");, what is the encoding of "message"?

Of course, we know that after phase 1, it will be utf-8 (or otherwise
representable in utf-8), and upon being displayed or otherwise written out
somewhere, it will be converted to something the compiler deems suitable
for that.

So the question really is: is there an intermediate step wherein the string
is converted to the execution encoding in phase 5?
There is currently nothing in the standard that says that does not happen,
all string-literal presumably go through phase 5.

And so, the status-quo leads to implementation divergence such that a fix
is needed: GCC does the useful thing while MSVC/ICC do the standard
conforming thing https://godbolt.org/z/MEsbY5

What would be the correct behavior is slightly less clear and future
evolutions make it more complicated.

There is a good argument for saying that everything that looks like a
string literal is a string literal and therefore, static_assert and
attributes parameters should go through phase 5, and then converted from
execution encoding to whatever the encoding used by the compiler for
diagnostic purposes.
This has some interesting ramifications, ie static_assert(false, "😀")
might be ill-formed if the string cannot be encoded to the execution
encoding in phase 5 (or it might do character replacement in phase 5 under
the current rules, which is what MSVC does with regular string literals)
We could then allow static_assert(false, u8"😀") to avoid the above issues.
This first solution has the very clear advantages that it makes the model
very simple, "" is an execution-encoding encoded string literal, u8"" is
utf-8.

The opposite argument of course is that forcing people to prefix everything
by u8"" is a bit hostile, and, as it is a departure from the current
behavior, would break code.

We also need to consider possible evolutions of the language, notably
* diagnostic or compiler output constructed from constant expressions at
compile time wg21.link/p0596r1
* reflection on attributes https://wg21.link/p1887r1
* attribute using constant expressions parameters, although I don't know if
that has been proposed

so, we can imagine something like

static_assert(false, std::format(...)); which would be neat indeed.
At this point, we would be very much past phase 5 and it becomes critical
to have a good model indeed.

I also would like to point out that, at compile time, there is never,
everything else being equal, a good reason to prefer the execution encoding
over utf-8.

Given these observations and constraints I think a possible, pragmatic and
simple course of action would be

   - Redefine deprecated, nodiscard, static_assert, etc to take a new
   grammar , say "diagnostic-string-literal", which would follow all the rules
   of string literals (concatenation, escape sequence and so forth), but would
   NOT be converted to the execution encoding at any point. Note that this
   does not introduce a new encoding, things stay utf-8.
   - In the future, static_assert and attributes can accept other forms
   which would take constant expressions of u8string_view (or so I hope, see
   wg21.link/p1953r0). Because all of these things require compiler support
   anyway, parsing has no ambiguity)
   - In this model, reflecting on [[deprecated("foo")]] would give a utf8
   string back, because we decided to make these strings magic for
   convenience and backward compact

The alternative solution (which is less pragmatic) would be:

   - Allow u8 string literals in attributes and static_assert in addition
   of string literals
   - Pass everything through phase 5, always

That second solution, being a breaking change, would require EWG input. Its
sole benefits is to make the model brutally consistent, which would not be
without value either.

I'm planning to put all of that in a paper but I would like to hear your
thoughts before doing so.

Thanks,
Have a great week,

Corentin

Received on 2021-01-25 07:02:08