C++ Logo

SG16

Advanced search

Subject: Re: Handling of non-basic characters in early translation phases
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2020-06-20 09:51:58


On 20/06/2020 15.47, Corentin Jabot wrote:

> What's the implementation strategy for an implementation that wishes to provide
> byte pass-through in string literals under your approach, which tunnels everything
> through Unicode?
>
>
> As long as a source character can be converted to a Unicode character, and that Unicode character can be converted back to the same original character,
> does it matter if it was or not?
>
> If A1 -> B is a valid transcoding operation, then B -> A1  is a valid transcoding operation whether there exists or not a separate B -> A2 transcoding operation.
> An implementation strategy would be to keep track of the original character (maybe by lexing character by character in the source file encoding), or use some other form of tracking. But does that strategy have to be specified in the standard?

"Silently keep track of the original character" is what
we currently have with reverting raw string literals.
It feels like cheating to require phase 5 to rely on information
stashed away in some hiding place in phase 1.

However:
My reading of the standard is that we don't allow this behavior
currently, because we tunnel everything through UCNs. So, allowing
more seems evolutionary, which is probably a different paper.

> In particular, "there are numbers greater than 10FFFF that can be used" may not be the best implementation strategy.

Sure, but that's for the implementation to decide. After all, we're only talking
about abstract characters, not about numbers.

> > #define CONCAT(x,y) x##y 
> > CONCAT(\, U0001F431); 
> >
> > Is valid in all implementations I tested, implementation-defined in the standard.
>
> Is the result the named Unicode character?
>
> Ok, so be it.  Having this as valid is fall-out from the
> currently-described approach, and if it's consistent with
> what implementations already do, we're good.
>
> > Do you see a reason to not allow it? in particular, as we move ucns handling later
> > in the process, it would make sense to allow these escape sequences to be created in phase 2 and 4 (might be evolutionary, there is a paper)
>
> I think the status quo already allows creating UCNs like that,
> so this doesn't seem to be evolutionary at all.
>
>
> Isn't changing "it is implementation-defined whether ucns are formed" to "ucns are formed" evolutionarry?

Yes, it is. Do you have a reference for the "implementation-defined" part
here? [cpp.concat] doesn't seem to speak to the issue.

Jens


SG16 list run by sg16-owner@lists.isocpp.org