On Mon, Jan 25, 2021 at 11:13 PM Corentin <corentin.jabot@gmail.com> wrote:

On Mon, Jan 25, 2021 at 10:35 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 25/01/2021 14.01, Corentin via SG16 wrote:
> So the question really is: is there an intermediate step wherein the string is converted to the execution encoding in phase 5?
> There is currently nothing in the standard that says that does not happen, all string-literal presumably go through phase 5.

Yes. So any change in this area probably needs EWG input.

> And so, the status-quo leads to implementation divergence such that a fix is needed: GCC does the useful thing while MSVC/ICC do the standard conforming thing https://godbolt.org/z/MEsbY5 <https://godbolt.org/z/MEsbY5>

> We also need to consider possible evolutions of the language, notably
> * diagnostic or compiler output constructed from constant expressions at compile time wg21.link/p0596r1
> * reflection on attributes https://wg21.link/p1887r1 <https://wg21.link/p1887r1>
> * attribute using constant expressions parameters, although I don't know if that has been proposed
>
> so, we can imagine something like
>
> static_assert(false, std::format(...)); which would be neat indeed.
> At this point, we would be very much past phase 5 and it becomes critical to have a good model indeed.

The model would be to perform the transcoding to execution character set
when a (runtime) object for the string (literal) is created by the compiler.
This is the same moment when we turn the memory of a compile-time std::vector<T>
into a runtime data structure.
It might be hard to differentiate a compile-time string in a std::string
from some compile-time bytes that happen to exist in a std::vector<char>,
though. In order to make this right, we probably need some machinery
to say "here comes a string".

I think this would be a terrible idea because it's observable, ie that function would return widely different result depending on the execution encoding:

constexpr int count_codepoints(std::string_view);

> * Redefine deprecated, nodiscard, static_assert, etc to take a new grammar , say "diagnostic-string-literal", which would follow all the rules of string literals (concatenation, escape sequence and so forth), but would NOT be converted to the execution encoding at any point. Note that this does not introduce a new encoding, things stay utf-8.
> * In the future, static_assert and attributes can accept other forms which would take constant expressions of u8string_view (or so I hope, see wg21.link/p1953r0). Because all of these things require compiler support anyway, parsing has no ambiguity)
> * In this model, reflecting on [[deprecated("foo")]] would give a utf8 string back, because we decided to make these strings magic for convenience and backward compact

Sounds about right, minus the "UTF-8" parts, which are private parts of the compiler
not specified by the standard.

The encoding of the strings returned by reflection would have to be specified - the compiler might have to do some conversion from its representation

> I'm planning to put all of that in a paper but I would like to hear your thoughts before doing so.

Since you need phase 7 context to know when to transcode and when not, some of the phase 5+6
machinery probably needs to move to phase 7.

Right, we can't actually distinguish the context in phase 5+6 yet.
Gosh, for some reason I didn't identify this issue.

If at phase 7 we want string literals except in specific contexts... does that mean that the wording would have to operate some sort of reversal?
Moving phase 5 after 7 seems like major surgery, especially as we established that concatenation and encoding are certainly best left in the same step.
Wouldn't it require to identify all the places where a string-literal may appear, which is probably quite a few?

Actually....

It seems (you will certainly correct me If i am wrong) that string-literal only need to be converted to the execution encoding in places where a (primary-)expression is expected.

Neither the preprocessor, nor user defined literals, nor linkage specification, nor assembler statements, nor import names should go through phase 5

It would seem then than there are only one place where encoding to execution encoding should happen so maybe there would be a way to move part of phase 5 to [expr.prim.literal], but then we are back to the

issue of how to separate hex escape sequence replacement from encoding.

I guess we could replace \x42 by \x{42} or something like that

Jens