sg16: Re: [SG16] String literals and diagnostics

From: JF Bastien <cxx_at_[hidden]>
Date: Mon, 25 Jan 2021 14:29:08 -0800

On Mon, Jan 25, 2021 at 2:19 PM Peter Brett <pbrett_at_[hidden]> wrote:

> Hi JF,
>
>
>
> Banning unescaped control characters entirely sounds easy, but things like
> RTL control characters are something that might reasonably appear in a
> diagnostic string that mixes English and Arabic, say.
>

Yes, IMO we should allow RTL, I hope my example made this clear :)

Another consideration is destination. If a diagnostic is emitted to the
> console, then control characters might cause problems. On the other hand,
> if it’s emitted into a machine-readable representation such as a JSON file,
> then it seems better to preserve the diagnostic in all its ugliness: warts,
> control characters, embedded nuls (!) and all.
>

I don't think control characters and NUL are particularly useful in
diagnostics, ever. Based on the discussion with Corentin it seems better to
just ban them entirely. Unless there's a usecase? I agree that we can be
clever about the output device, but it still doesn't seem useful to do so.

Say we decided "printing colored output would be nice when the terminal
supports it"... we shouldn't just work around unescaped control characters
IMO, we should have higher-level support for this, which turns colors to
nothing if the device doesn't support them. I don't think we should be
doing this, but whether we do it or not, I think we should just ban control
characters in diagnostics.

Peter
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *JF Bastien
> via SG16
> *Sent:* 25 January 2021 22:15
> *To:* Corentin <corentin.jabot_at_[hidden]>
> *Cc:* JF Bastien <cxx_at_[hidden]>; SG16 <sg16_at_[hidden]>; Aaron
> Ballman <aaron.ballman_at_[hidden]>
> *Subject:* Re: [SG16] String literals and diagnostics
>
>
>
> EXTERNAL MAIL
>
> On Mon, Jan 25, 2021 at 2:11 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>
>
>
>
> On Mon, Jan 25, 2021 at 11:06 PM JF Bastien <cxx_at_[hidden]> wrote:
>
> Your comment makes me realize that my thinking would probably make more
> sense if I shared its motivation: as a developer writing C++ code, I would
> like to be able to report facts to developers through static_assert /
> deprecated / etc. These facts are usually in ASCII, but some folks use
> other languages and supporting Unicode languages therefore makes sense.
> Those facts might be more nicely displayed with newlines. Those facts don't
> need control characters, even if color, blink, and changing the console's
> window title would be funny. I (as a C++ programmer, not committee member)
> would like this to be uniform between implementations.
>
>
>
> This is why I proposed the approach I did: as a user I don't really care
> about phases, QoI, etc. I think it's a useful framing for this problem,
> because agreeing on it helps figure out whether our solution actually helps
> people :)
>
>
>
> I'm trying to understand your needs, actually whether it is:
>
>
>
> 1/ I want the compiler to escape all control/invisible characters in the
> diagnostic message
>
> 2/ If an escape sequence appear verbatim in a string (aka "\u0042\x0042\t"
> ) then it should appear exactly like that in the diagnostic message even if
> those escape sequences represent displayable characters
>
>
>
> Yeah good point: we can ban unescaped control characters entirely, instead
> of escaping them :)
>
>
>
>
>
> On Mon, Jan 25, 2021 at 1:59 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>
>
>
>
> On Mon, Jan 25, 2021 at 10:51 PM JF Bastien <cxx_at_[hidden]> wrote:
>
> I will offer this most wonderful example as a test suite for the
> discussion: https://twitter.com/jfbastien/status/1298307325443231744
> <https://urldefense.com/v3/__https:/twitter.com/jfbastien/status/1298307325443231744__;!!EHscmS1ygiU1lA!TRqx-Sf-Wdf64zmoSTDPhxgyi789zKh0-EVTGCHBpJtWfSVFtlIHJTqZ02Msgw$>
>
> (sorry not sorry that I tweet randomly bad C++ as bait for y'all)
>
>
>
> I believe that most printable characters should be preserved by the
> compiler and printed as-is, Unicode included. I would escape control
> characters, null terminators, be careful around RTL but support it, and
> support newlines too (because why not?).
>
>
>
> I think most of that is QOL - specifically how not printable characters
> and control characters.
>
> Not replacing escape sequences however is probably problematic.
>
>
>
> First, is it the behavior we want for static assert?
>
> And if so, then we want to do something different in static_assert vs
> attributes ? (the two places where we currently have diagnostics after
> phase 3, the later of which might become program observable through
> reflection)
>
> So I think the use case for "not replacing escape sequence" should either
> be super motivated, or QOL?
>
>
>
> Otherwise I agree
>
>
>
>
>
> I think clang's current behavior is pretty much what I'd want, except for
> newlines.
>
>
>
> <chair-hat>As Jens said, please send to EWG once SG16 is happy.</chair-hat>
>
>
>
> On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
> Hello SG16.
>
> Following last week's discussion on diagnostic messages, I would like to
> come back to the topic.
>
>
>
> What follows specifically exclude the preprocessor, and has no bearing on
> Aaron's paper which I think is strictly a bug fix of a status quo.
>
>
>
> For those who are just joining us, the question at hand is: Given static_assert(foo,
> "messsage");, what is the encoding of "message"?
>
>
>
> Of course, we know that after phase 1, it will be utf-8 (or otherwise
> representable in utf-8), and upon being displayed or otherwise written out
> somewhere, it will be converted to something the compiler deems suitable
> for that.
>
>
>
> So the question really is: is there an intermediate step wherein the
> string is converted to the execution encoding in phase 5?
>
> There is currently nothing in the standard that says that does not happen,
> all string-literal presumably go through phase 5.
>
>
>
> And so, the status-quo leads to implementation divergence such that a fix
> is needed: GCC does the useful thing while MSVC/ICC do the standard
> conforming thing https://godbolt.org/z/MEsbY5
> <https://urldefense.com/v3/__https:/godbolt.org/z/MEsbY5__;!!EHscmS1ygiU1lA!TRqx-Sf-Wdf64zmoSTDPhxgyi789zKh0-EVTGCHBpJtWfSVFtlIHJToImmDxJA$>
>
>
>
>
>
>
> What would be the correct behavior is slightly less clear and future
> evolutions make it more complicated.
>
>
>
> There is a good argument for saying that everything that looks like a
> string literal is a string literal and therefore, static_assert and
> attributes parameters should go through phase 5, and then converted from
> execution encoding to whatever the encoding used by the compiler for
> diagnostic purposes.
>
> This has some interesting ramifications, ie static_assert(false, "😀")
> might be ill-formed if the string cannot be encoded to the execution
> encoding in phase 5 (or it might do character replacement in phase 5 under
> the current rules, which is what MSVC does with regular string literals)
>
> We could then allow static_assert(false, u8"😀") to avoid the above
> issues.
>
> This first solution has the very clear advantages that it makes the model
> very simple, "" is an execution-encoding encoded string literal, u8"" is
> utf-8.
>
>
>
> The opposite argument of course is that forcing people to prefix
> everything by u8"" is a bit hostile, and, as it is a departure from the
> current behavior, would break code.
>
>
>
>
>
> We also need to consider possible evolutions of the language, notably
>
> * diagnostic or compiler output constructed from constant expressions at
> compile time wg21.link/p0596r1
>
> * reflection on attributes https://wg21.link/p1887r1
> <https://urldefense.com/v3/__https:/wg21.link/p1887r1__;!!EHscmS1ygiU1lA!TRqx-Sf-Wdf64zmoSTDPhxgyi789zKh0-EVTGCHBpJtWfSVFtlIHJTrEXC5aCg$>
>
> * attribute using constant expressions parameters, although I don't know
> if that has been proposed
>
>
>
> so, we can imagine something like
>
>
>
> static_assert(false, std::format(...)); which would be neat indeed.
>
> At this point, we would be very much past phase 5 and it becomes critical
> to have a good model indeed.
>
>
>
> I also would like to point out that, at compile time, there is never,
> everything else being equal, a good reason to prefer the execution encoding
> over utf-8.
>
>
>
> Given these observations and constraints I think a possible, pragmatic and
> simple course of action would be
>
>
>
> - Redefine deprecated, nodiscard, static_assert, etc to take a new
> grammar , say "diagnostic-string-literal", which would follow all the rules
> of string literals (concatenation, escape sequence and so forth), but would
> NOT be converted to the execution encoding at any point. Note that this
> does not introduce a new encoding, things stay utf-8.
> - In the future, static_assert and attributes can accept other forms
> which would take constant expressions of u8string_view (or so I hope, see
> wg21.link/p1953r0). Because all of these things require compiler support
> anyway, parsing has no ambiguity)
> - In this model, reflecting on [[deprecated("foo")]] would give a utf8
> string back, because we decided to make these strings magic for
> convenience and backward compact
>
>
>
>
>
> The alternative solution (which is less pragmatic) would be:
>
> - Allow u8 string literals in attributes and static_assert in addition
> of string literals
> - Pass everything through phase 5, always
>
> That second solution, being a breaking change, would require EWG input.
> Its sole benefits is to make the model brutally consistent, which would not
> be without value either.
>
>
>
> I'm planning to put all of that in a paper but I would like to hear your
> thoughts before doing so.
>
>
>
> Thanks,
>
> Have a great week,
>
>
>
> Corentin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!TRqx-Sf-Wdf64zmoSTDPhxgyi789zKh0-EVTGCHBpJtWfSVFtlIHJTqjJqGJaQ$>
>
>

Received on 2021-01-25 16:29:22