Hi JF,

Banning unescaped control characters entirely sounds easy, but things like RTL control characters are something that might reasonably appear in a diagnostic string that mixes English and Arabic, say.

Another consideration is destination. If a diagnostic is emitted to the console, then control characters might cause problems. On the other hand, if it’s emitted into a machine-readable representation such as a JSON file, then it seems better to preserve the diagnostic in all its ugliness: warts, control characters, embedded nuls (!) and all.

Peter

From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of JF Bastien via SG16
Sent: 25 January 2021 22:15
To: Corentin <corentin.jabot@gmail.com>
Cc: JF Bastien <cxx@jfbastien.com>; SG16 <sg16@lists.isocpp.org>; Aaron Ballman <aaron.ballman@gmail.com>
Subject: Re: [SG16] String literals and diagnostics

EXTERNAL MAIL

On Mon, Jan 25, 2021 at 2:11 PM Corentin <corentin.jabot@gmail.com> wrote:

On Mon, Jan 25, 2021 at 11:06 PM JF Bastien <cxx@jfbastien.com> wrote:

Your comment makes me realize that my thinking would probably make more sense if I shared its motivation: as a developer writing C++ code, I would like to be able to report facts to developers through static_assert / deprecated / etc. These facts are usually in ASCII, but some folks use other languages and supporting Unicode languages therefore makes sense. Those facts might be more nicely displayed with newlines. Those facts don't need control characters, even if color, blink, and changing the console's window title would be funny. I (as a C++ programmer, not committee member) would like this to be uniform between implementations.

This is why I proposed the approach I did: as a user I don't really care about phases, QoI, etc. I think it's a useful framing for this problem, because agreeing on it helps figure out whether our solution actually helps people :)

I'm trying to understand your needs, actually whether it is:

1/ I want the compiler to escape all control/invisible characters in the diagnostic message

2/ If an escape sequence appear verbatim in a string (aka "\u0042\x0042\t" ) then it should appear exactly like that in the diagnostic message even if those escape sequences represent displayable characters

Yeah good point: we can ban unescaped control characters entirely, instead of escaping them :)

On Mon, Jan 25, 2021 at 1:59 PM Corentin <corentin.jabot@gmail.com> wrote:

On Mon, Jan 25, 2021 at 10:51 PM JF Bastien <cxx@jfbastien.com> wrote:

I will offer this most wonderful example as a test suite for the discussion: https://twitter.com/jfbastien/status/1298307325443231744

(sorry not sorry that I tweet randomly bad C++ as bait for y'all)

I believe that most printable characters should be preserved by the compiler and printed as-is, Unicode included. I would escape control characters, null terminators, be careful around RTL but support it, and support newlines too (because why not?).

I think most of that is QOL - specifically how not printable characters and control characters.

Not replacing escape sequences however is probably problematic.

First, is it the behavior we want for static assert?

And if so, then we want to do something different in static_assert vs attributes ? (the two places where we currently have diagnostics after phase 3, the later of which might become program observable through reflection)

So I think the use case for "not replacing escape sequence" should either be super motivated, or QOL?

Otherwise I agree

I think clang's current behavior is pretty much what I'd want, except for newlines.

<chair-hat>As Jens said, please send to EWG once SG16 is happy.</chair-hat>

On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Hello SG16.

Following last week's discussion on diagnostic messages, I would like to come back to the topic.

What follows specifically exclude the preprocessor, and has no bearing on Aaron's paper which I think is strictly a bug fix of a status quo.

For those who are just joining us, the question at hand is: Given static_assert(foo, "messsage");, what is the encoding of "message"?

Of course, we know that after phase 1, it will be utf-8 (or otherwise representable in utf-8), and upon being displayed or otherwise written out somewhere, it will be converted to something the compiler deems suitable for that.

So the question really is: is there an intermediate step wherein the string is converted to the execution encoding in phase 5?

There is currently nothing in the standard that says that does not happen, all string-literal presumably go through phase 5.

And so, the status-quo leads to implementation divergence such that a fix is needed: GCC does the useful thing while MSVC/ICC do the standard conforming thing https://godbolt.org/z/MEsbY5

What would be the correct behavior is slightly less clear and future evolutions make it more complicated.

There is a good argument for saying that everything that looks like a string literal is a string literal and therefore, static_assert and attributes parameters should go through phase 5, and then converted from execution encoding to whatever the encoding used by the compiler for diagnostic purposes.

This has some interesting ramifications, ie static_assert(false, "😀") might be ill-formed if the string cannot be encoded to the execution encoding in phase 5 (or it might do character replacement in phase 5 under the current rules, which is what MSVC does with regular string literals)

We could then allow static_assert(false, u8"😀") to avoid the above issues.

This first solution has the very clear advantages that it makes the model very simple, "" is an execution-encoding encoded string literal, u8"" is utf-8.

The opposite argument of course is that forcing people to prefix everything by u8"" is a bit hostile, and, as it is a departure from the current behavior, would break code.

We also need to consider possible evolutions of the language, notably

* diagnostic or compiler output constructed from constant expressions at compile time wg21.link/p0596r1

* reflection on attributes https://wg21.link/p1887r1

* attribute using constant expressions parameters, although I don't know if that has been proposed

so, we can imagine something like

static_assert(false, std::format(...)); which would be neat indeed.

At this point, we would be very much past phase 5 and it becomes critical to have a good model indeed.

I also would like to point out that, at compile time, there is never, everything else being equal, a good reason to prefer the execution encoding over utf-8.

Given these observations and constraints I think a possible, pragmatic and simple course of action would be

Redefine deprecated, nodiscard, static_assert, etc to take a new grammar , say "diagnostic-string-literal", which would follow all the rules of string literals (concatenation, escape sequence and so forth), but would NOT be converted to the execution encoding at any point. Note that this does not introduce a new encoding, things stay utf-8.
In the future, static_assert and attributes can accept other forms which would take constant expressions of u8string_view (or so I hope, see wg21.link/p1953r0). Because all of these things require compiler support anyway, parsing has no ambiguity)
In this model, reflecting on [[deprecated("foo")]] would give a utf8 string back, because we decided to make these strings magic for convenience and backward compact

The alternative solution (which is less pragmatic) would be:

Allow u8 string literals in attributes and static_assert in addition of string literals
Pass everything through phase 5, always

That second solution, being a breaking change, would require EWG input. Its sole benefits is to make the model brutally consistent, which would not be without value either.

I'm planning to put all of that in a paper but I would like to hear your thoughts before doing so.

Thanks,

Have a great week,

Corentin

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16