C++ Logo

SG16

Advanced search

Subject: Re: String literals and diagnostics
From: JF Bastien (cxx_at_[hidden])
Date: 2021-01-25 16:14:49


On Mon, Jan 25, 2021 at 2:11 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Mon, Jan 25, 2021 at 11:06 PM JF Bastien <cxx_at_[hidden]> wrote:
>
>> Your comment makes me realize that my thinking would probably make more
>> sense if I shared its motivation: as a developer writing C++ code, I would
>> like to be able to report facts to developers through static_assert /
>> deprecated / etc. These facts are usually in ASCII, but some folks use
>> other languages and supporting Unicode languages therefore makes sense.
>> Those facts might be more nicely displayed with newlines. Those facts don't
>> need control characters, even if color, blink, and changing the console's
>> window title would be funny. I (as a C++ programmer, not committee member)
>> would like this to be uniform between implementations.
>>
>> This is why I proposed the approach I did: as a user I don't really care
>> about phases, QoI, etc. I think it's a useful framing for this problem,
>> because agreeing on it helps figure out whether our solution actually helps
>> people :)
>>
>
> I'm trying to understand your needs, actually whether it is:
>
> 1/ I want the compiler to escape all control/invisible characters in the
> diagnostic message
> 2/ If an escape sequence appear verbatim in a string (aka "\u0042\x0042\t"
> ) then it should appear exactly like that in the diagnostic message even if
> those escape sequences represent displayable characters
>

Yeah good point: we can ban unescaped control characters entirely, instead
of escaping them :)

> On Mon, Jan 25, 2021 at 1:59 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>>
>>>
>>>
>>> On Mon, Jan 25, 2021 at 10:51 PM JF Bastien <cxx_at_[hidden]> wrote:
>>>
>>>> I will offer this most wonderful example as a test suite for the
>>>> discussion: https://twitter.com/jfbastien/status/1298307325443231744
>>>> (sorry not sorry that I tweet randomly bad C++ as bait for y'all)
>>>>
>>>> I believe that most printable characters should be preserved by the
>>>> compiler and printed as-is, Unicode included. I would escape control
>>>> characters, null terminators, be careful around RTL but support it, and
>>>> support newlines too (because why not?).
>>>>
>>>
>>> I think most of that is QOL - specifically how not printable characters
>>> and control characters.
>>> Not replacing escape sequences however is probably problematic.
>>>
>>> First, is it the behavior we want for static assert?
>>> And if so, then we want to do something different in static_assert vs
>>> attributes ? (the two places where we currently have diagnostics after
>>> phase 3, the later of which might become program observable through
>>> reflection)
>>> So I think the use case for "not replacing escape sequence" should
>>> either be super motivated, or QOL?
>>>
>>> Otherwise I agree
>>>
>>>
>>>>
>>>> I think clang's current behavior is pretty much what I'd want, except
>>>> for newlines.
>>>>
>>>> <chair-hat>As Jens said, please send to EWG once SG16 is
>>>> happy.</chair-hat>
>>>>
>>>> On Mon, Jan 25, 2021 at 5:02 AM Corentin via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> Hello SG16.
>>>>> Following last week's discussion on diagnostic messages, I would like
>>>>> to come back to the topic.
>>>>>
>>>>> What follows specifically exclude the preprocessor, and has no bearing
>>>>> on Aaron's paper which I think is strictly a bug fix of a status quo.
>>>>>
>>>>> For those who are just joining us, the question at hand is: Given static_assert(foo,
>>>>> "messsage");, what is the encoding of "message"?
>>>>>
>>>>> Of course, we know that after phase 1, it will be utf-8 (or otherwise
>>>>> representable in utf-8), and upon being displayed or otherwise written out
>>>>> somewhere, it will be converted to something the compiler deems suitable
>>>>> for that.
>>>>>
>>>>> So the question really is: is there an intermediate step wherein the
>>>>> string is converted to the execution encoding in phase 5?
>>>>> There is currently nothing in the standard that says that does not
>>>>> happen, all string-literal presumably go through phase 5.
>>>>>
>>>>> And so, the status-quo leads to implementation divergence such that a
>>>>> fix is needed: GCC does the useful thing while MSVC/ICC do the standard
>>>>> conforming thing https://godbolt.org/z/MEsbY5
>>>>>
>>>>>
>>>>> What would be the correct behavior is slightly less clear and future
>>>>> evolutions make it more complicated.
>>>>>
>>>>> There is a good argument for saying that everything that looks like a
>>>>> string literal is a string literal and therefore, static_assert and
>>>>> attributes parameters should go through phase 5, and then converted from
>>>>> execution encoding to whatever the encoding used by the compiler for
>>>>> diagnostic purposes.
>>>>> This has some interesting ramifications, ie static_assert(false, "😀")
>>>>> might be ill-formed if the string cannot be encoded to the execution
>>>>> encoding in phase 5 (or it might do character replacement in phase 5 under
>>>>> the current rules, which is what MSVC does with regular string literals)
>>>>> We could then allow static_assert(false, u8"😀") to avoid the above
>>>>> issues.
>>>>> This first solution has the very clear advantages that it makes the
>>>>> model very simple, "" is an execution-encoding encoded string literal, u8""
>>>>> is utf-8.
>>>>>
>>>>> The opposite argument of course is that forcing people to prefix
>>>>> everything by u8"" is a bit hostile, and, as it is a departure from the
>>>>> current behavior, would break code.
>>>>>
>>>>>
>>>>> We also need to consider possible evolutions of the language, notably
>>>>> * diagnostic or compiler output constructed from constant expressions
>>>>> at compile time wg21.link/p0596r1
>>>>> * reflection on attributes https://wg21.link/p1887r1
>>>>> * attribute using constant expressions parameters, although I don't
>>>>> know if that has been proposed
>>>>>
>>>>> so, we can imagine something like
>>>>>
>>>>> static_assert(false, std::format(...)); which would be neat indeed.
>>>>> At this point, we would be very much past phase 5 and it becomes
>>>>> critical to have a good model indeed.
>>>>>
>>>>> I also would like to point out that, at compile time, there is never,
>>>>> everything else being equal, a good reason to prefer the execution encoding
>>>>> over utf-8.
>>>>>
>>>>> Given these observations and constraints I think a possible, pragmatic
>>>>> and simple course of action would be
>>>>>
>>>>>
>>>>> - Redefine deprecated, nodiscard, static_assert, etc to take a new
>>>>> grammar , say "diagnostic-string-literal", which would follow all the rules
>>>>> of string literals (concatenation, escape sequence and so forth), but would
>>>>> NOT be converted to the execution encoding at any point. Note that this
>>>>> does not introduce a new encoding, things stay utf-8.
>>>>> - In the future, static_assert and attributes can accept other
>>>>> forms which would take constant expressions of u8string_view (or so I hope,
>>>>> see wg21.link/p1953r0). Because all of these things require compiler
>>>>> support anyway, parsing has no ambiguity)
>>>>> - In this model, reflecting on [[deprecated("foo")]] would give a
>>>>> utf8 string back, because we decided to make these strings magic for
>>>>> convenience and backward compact
>>>>>
>>>>>
>>>>>
>>>>> The alternative solution (which is less pragmatic) would be:
>>>>>
>>>>> - Allow u8 string literals in attributes and static_assert in
>>>>> addition of string literals
>>>>> - Pass everything through phase 5, always
>>>>>
>>>>> That second solution, being a breaking change, would require EWG
>>>>> input. Its sole benefits is to make the model brutally consistent, which
>>>>> would not be without value either.
>>>>>
>>>>> I'm planning to put all of that in a paper but I would like to hear
>>>>> your thoughts before doing so.
>>>>>
>>>>> Thanks,
>>>>> Have a great week,
>>>>>
>>>>> Corentin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>



SG16 list run by sg16-owner@lists.isocpp.org