sg16: Re: [SG16] String literals and diagnostics

From: Steve Downey <sdowney_at_[hidden]>
Date: Mon, 25 Jan 2021 18:28:14 -0500

To be clear, it's OK if we can't mandate that, but it would be nice if the
standard was clear that doing something awful wasn't required.

On Mon, Jan 25, 2021, 18:26 Steve Downey <sdowney_at_[hidden]> wrote:

> User Story: As a compiler and library user, I would like to be able to
> stuff the text of the message that arrived in an email from the CI build
> into opengrok and find the source code that emitted it.
> The equivalent case of building within emacs/vscode/vi is largely solved
> already.
>
> On Mon, Jan 25, 2021, 17:46 Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>>
>>
>> On Mon, Jan 25, 2021 at 11:13 PM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Mon, Jan 25, 2021 at 10:35 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>> wrote:
>>>
>>>> On 25/01/2021 14.01, Corentin via SG16 wrote:
>>>> > So the question really is: is there an intermediate step wherein the
>>>> string is converted to the execution encoding in phase 5?
>>>> > There is currently nothing in the standard that says that does not
>>>> happen, all string-literal presumably go through phase 5.
>>>>
>>>> Yes. So any change in this area probably needs EWG input.
>>>>
>>>> > And so, the status-quo leads to implementation divergence such that a
>>>> fix is needed: GCC does the useful thing while MSVC/ICC do the standard
>>>> conforming thing https://godbolt.org/z/MEsbY5 <
>>>> https://godbolt.org/z/MEsbY5>
>>>>
>>>> > We also need to consider possible evolutions of the language, notably
>>>> > * diagnostic or compiler output constructed from constant expressions
>>>> at compile time wg21.link/p0596r1
>>>> > * reflection on attributes https://wg21.link/p1887r1 <
>>>> https://wg21.link/p1887r1>
>>>> > * attribute using constant expressions parameters, although I don't
>>>> know if that has been proposed
>>>> >
>>>> > so, we can imagine something like
>>>> >
>>>> > static_assert(false, std::format(...)); which would be neat indeed.
>>>> > At this point, we would be very much past phase 5 and it becomes
>>>> critical to have a good model indeed.
>>>>
>>>> The model would be to perform the transcoding to execution character set
>>>> when a (runtime) object for the string (literal) is created by the
>>>> compiler.
>>>> This is the same moment when we turn the memory of a compile-time
>>>> std::vector<T>
>>>> into a runtime data structure.
>>>> It might be hard to differentiate a compile-time string in a std::string
>>>> from some compile-time bytes that happen to exist in a
>>>> std::vector<char>,
>>>> though. In order to make this right, we probably need some machinery
>>>> to say "here comes a string".
>>>>
>>>
>>> I think this would be a terrible idea because it's observable, ie that
>>> function would return widely different result depending on the execution
>>> encoding:
>>>
>>> constexpr int count_codepoints(std::string_view);
>>>
>>>>
>>>> > * Redefine deprecated, nodiscard, static_assert, etc to take a new
>>>> grammar , say "diagnostic-string-literal", which would follow all the rules
>>>> of string literals (concatenation, escape sequence and so forth), but would
>>>> NOT be converted to the execution encoding at any point. Note that this
>>>> does not introduce a new encoding, things stay utf-8.
>>>> > * In the future, static_assert and attributes can accept other
>>>> forms which would take constant expressions of u8string_view (or so I hope,
>>>> see wg21.link/p1953r0). Because all of these things require compiler
>>>> support anyway, parsing has no ambiguity)
>>>> > * In this model, reflecting on [[deprecated("foo")]] would give a
>>>> utf8 string back, because we decided to make these strings magic for
>>>> convenience and backward compact
>>>>
>>>> Sounds about right, minus the "UTF-8" parts, which are private parts of
>>>> the compiler
>>>> not specified by the standard.
>>>>
>>>
>>> The encoding of the strings returned by reflection would have to be
>>> specified - the compiler might have to do some conversion from its
>>> representation
>>>
>>>>
>>>> > I'm planning to put all of that in a paper but I would like to hear
>>>> your thoughts before doing so.
>>>>
>>>> Since you need phase 7 context to know when to transcode and when not,
>>>> some of the phase 5+6
>>>> machinery probably needs to move to phase 7.
>>>>
>>>
>>> Right, we can't actually distinguish the context in phase 5+6 yet.
>>> Gosh, for some reason I didn't identify this issue.
>>>
>>> If at phase 7 we want string literals except in specific contexts...
>>> does that mean that the wording would have to operate some sort of reversal?
>>> Moving phase 5 after 7 seems like major surgery, especially as we
>>> established that concatenation and encoding are certainly best left in the
>>> same step.
>>> Wouldn't it require to identify all the places where a string-literal
>>> may appear, which is probably quite a few?
>>>
>>
>> Actually....
>> It seems (you will certainly correct me If i am wrong) that
>> string-literal only need to be converted to the execution encoding in
>> places where a (primary-)expression is expected.
>> Neither the preprocessor, nor user defined literals, nor
>> linkage specification, nor assembler statements, nor import names should go
>> through phase 5
>> It would seem then than there are only one place where encoding to
>> execution encoding should happen so maybe there would be a way to move part
>> of phase 5 to [expr.prim.literal], but then we are back to the
>> issue of how to separate hex escape sequence replacement from encoding.
>> I guess we could replace \x42 by \x{42} or something like that
>>
>>
>>
>>>
>>>
>>>
>>>>
>>>> Jens
>>>>
>>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2021-01-25 17:28:28