sg16: Re: [SG16] String literals and diagnostics

From: JF Bastien <cxx_at_[hidden]>
Date: Mon, 25 Jan 2021 13:51:15 -0800

I will offer this most wonderful example as a test suite for the
discussion: https://twitter.com/jfbastien/status/1298307325443231744
(sorry not sorry that I tweet randomly bad C++ as bait for y'all)

I believe that most printable characters should be preserved by the
compiler and printed as-is, Unicode included. I would escape control
characters, null terminators, be careful around RTL but support it, and
support newlines too (because why not?).

I think clang's current behavior is pretty much what I'd want, except for
newlines.

<chair-hat>As Jens said, please send to EWG once SG16 is happy.</chair-hat>

On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:

> Hello SG16.
> Following last week's discussion on diagnostic messages, I would like to
> come back to the topic.
>
> What follows specifically exclude the preprocessor, and has no bearing on
> Aaron's paper which I think is strictly a bug fix of a status quo.
>
> For those who are just joining us, the question at hand is: Given static_assert(foo,
> "messsage");, what is the encoding of "message"?
>
> Of course, we know that after phase 1, it will be utf-8 (or otherwise
> representable in utf-8), and upon being displayed or otherwise written out
> somewhere, it will be converted to something the compiler deems suitable
> for that.
>
> So the question really is: is there an intermediate step wherein the
> string is converted to the execution encoding in phase 5?
> There is currently nothing in the standard that says that does not happen,
> all string-literal presumably go through phase 5.
>
> And so, the status-quo leads to implementation divergence such that a fix
> is needed: GCC does the useful thing while MSVC/ICC do the standard
> conforming thing https://godbolt.org/z/MEsbY5
>
>
> What would be the correct behavior is slightly less clear and future
> evolutions make it more complicated.
>
> There is a good argument for saying that everything that looks like a
> string literal is a string literal and therefore, static_assert and
> attributes parameters should go through phase 5, and then converted from
> execution encoding to whatever the encoding used by the compiler for
> diagnostic purposes.
> This has some interesting ramifications, ie static_assert(false, "😀")
> might be ill-formed if the string cannot be encoded to the execution
> encoding in phase 5 (or it might do character replacement in phase 5 under
> the current rules, which is what MSVC does with regular string literals)
> We could then allow static_assert(false, u8"😀") to avoid the above issues.
> This first solution has the very clear advantages that it makes the model
> very simple, "" is an execution-encoding encoded string literal, u8"" is
> utf-8.
>
> The opposite argument of course is that forcing people to prefix
> everything by u8"" is a bit hostile, and, as it is a departure from the
> current behavior, would break code.
>
>
> We also need to consider possible evolutions of the language, notably
> * diagnostic or compiler output constructed from constant expressions at
> compile time wg21.link/p0596r1
> * reflection on attributes https://wg21.link/p1887r1
> * attribute using constant expressions parameters, although I don't know
> if that has been proposed
>
> so, we can imagine something like
>
> static_assert(false, std::format(...)); which would be neat indeed.
> At this point, we would be very much past phase 5 and it becomes critical
> to have a good model indeed.
>
> I also would like to point out that, at compile time, there is never,
> everything else being equal, a good reason to prefer the execution encoding
> over utf-8.
>
> Given these observations and constraints I think a possible, pragmatic and
> simple course of action would be
>
>
> - Redefine deprecated, nodiscard, static_assert, etc to take a new
> grammar , say "diagnostic-string-literal", which would follow all the rules
> of string literals (concatenation, escape sequence and so forth), but would
> NOT be converted to the execution encoding at any point. Note that this
> does not introduce a new encoding, things stay utf-8.
> - In the future, static_assert and attributes can accept other forms
> which would take constant expressions of u8string_view (or so I hope, see
> wg21.link/p1953r0). Because all of these things require compiler support
> anyway, parsing has no ambiguity)
> - In this model, reflecting on [[deprecated("foo")]] would give a utf8
> string back, because we decided to make these strings magic for
> convenience and backward compact
>
>
>
> The alternative solution (which is less pragmatic) would be:
>
> - Allow u8 string literals in attributes and static_assert in addition
> of string literals
> - Pass everything through phase 5, always
>
> That second solution, being a breaking change, would require EWG input.
> Its sole benefits is to make the model brutally consistent, which would not
> be without value either.
>
> I'm planning to put all of that in a paper but I would like to hear your
> thoughts before doing so.
>
> Thanks,
> Have a great week,
>
> Corentin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-01-25 15:51:28