C++ Logo


Advanced search

Re: [SG16] String literals and diagnostics

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 25 Jan 2021 23:45:36 +0100
On Mon, Jan 25, 2021 at 11:13 PM Corentin <corentin.jabot_at_[hidden]> wrote:

> On Mon, Jan 25, 2021 at 10:35 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>> On 25/01/2021 14.01, Corentin via SG16 wrote:
>> > So the question really is: is there an intermediate step wherein the
>> string is converted to the execution encoding in phase 5?
>> > There is currently nothing in the standard that says that does not
>> happen, all string-literal presumably go through phase 5.
>> Yes. So any change in this area probably needs EWG input.
>> > And so, the status-quo leads to implementation divergence such that a
>> fix is needed: GCC does the useful thing while MSVC/ICC do the standard
>> conforming thing https://godbolt.org/z/MEsbY5 <
>> https://godbolt.org/z/MEsbY5>
>> > We also need to consider possible evolutions of the language, notably
>> > * diagnostic or compiler output constructed from constant expressions
>> at compile time wg21.link/p0596r1
>> > * reflection on attributes https://wg21.link/p1887r1 <
>> https://wg21.link/p1887r1>
>> > * attribute using constant expressions parameters, although I don't
>> know if that has been proposed
>> >
>> > so, we can imagine something like
>> >
>> > static_assert(false, std::format(...)); which would be neat indeed.
>> > At this point, we would be very much past phase 5 and it becomes
>> critical to have a good model indeed.
>> The model would be to perform the transcoding to execution character set
>> when a (runtime) object for the string (literal) is created by the
>> compiler.
>> This is the same moment when we turn the memory of a compile-time
>> std::vector<T>
>> into a runtime data structure.
>> It might be hard to differentiate a compile-time string in a std::string
>> from some compile-time bytes that happen to exist in a std::vector<char>,
>> though. In order to make this right, we probably need some machinery
>> to say "here comes a string".
> I think this would be a terrible idea because it's observable, ie that
> function would return widely different result depending on the execution
> encoding:
> constexpr int count_codepoints(std::string_view);
>> > * Redefine deprecated, nodiscard, static_assert, etc to take a new
>> grammar , say "diagnostic-string-literal", which would follow all the rules
>> of string literals (concatenation, escape sequence and so forth), but would
>> NOT be converted to the execution encoding at any point. Note that this
>> does not introduce a new encoding, things stay utf-8.
>> > * In the future, static_assert and attributes can accept other forms
>> which would take constant expressions of u8string_view (or so I hope, see
>> wg21.link/p1953r0). Because all of these things require compiler support
>> anyway, parsing has no ambiguity)
>> > * In this model, reflecting on [[deprecated("foo")]] would give a
>> utf8 string back, because we decided to make these strings magic for
>> convenience and backward compact
>> Sounds about right, minus the "UTF-8" parts, which are private parts of
>> the compiler
>> not specified by the standard.
> The encoding of the strings returned by reflection would have to be
> specified - the compiler might have to do some conversion from its
> representation
>> > I'm planning to put all of that in a paper but I would like to hear
>> your thoughts before doing so.
>> Since you need phase 7 context to know when to transcode and when not,
>> some of the phase 5+6
>> machinery probably needs to move to phase 7.
> Right, we can't actually distinguish the context in phase 5+6 yet.
> Gosh, for some reason I didn't identify this issue.
> If at phase 7 we want string literals except in specific contexts... does
> that mean that the wording would have to operate some sort of reversal?
> Moving phase 5 after 7 seems like major surgery, especially as we
> established that concatenation and encoding are certainly best left in the
> same step.
> Wouldn't it require to identify all the places where a string-literal may
> appear, which is probably quite a few?

It seems (you will certainly correct me If i am wrong) that string-literal
only need to be converted to the execution encoding in places where a
(primary-)expression is expected.
Neither the preprocessor, nor user defined literals, nor
linkage specification, nor assembler statements, nor import names should go
through phase 5
It would seem then than there are only one place where encoding to
execution encoding should happen so maybe there would be a way to move part
of phase 5 to [expr.prim.literal], but then we are back to the
issue of how to separate hex escape sequence replacement from encoding.
I guess we could replace \x42 by \x{42} or something like that

>> Jens

Received on 2021-01-25 16:45:50