sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 20 Jun 2020 17:42:35 +0200

On 20/06/2020 17.28, Corentin Jabot wrote:
>
>
> On Sat, 20 Jun 2020 at 16:52, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 20/06/2020 15.47, Corentin Jabot wrote:
>
> > What's the implementation strategy for an implementation that wishes to provide
> > byte pass-through in string literals under your approach, which tunnels everything
> > through Unicode?
> >
> >
> > As long as a source character can be converted to a Unicode character, and that Unicode character can be converted back to the same original character,
> > does it matter if it was or not?
> >
> > If A1 -> B is a valid transcoding operation, then B -> A1 is a valid transcoding operation whether there exists or not a separate B -> A2 transcoding operation.
> > An implementation strategy would be to keep track of the original character (maybe by lexing character by character in the source file encoding), or use some other form of tracking. But does that strategy have to be specified in the standard?
>
> "Silently keep track of the original character" is what
> we currently have with reverting raw string literals.
> It feels like cheating to require phase 5 to rely on information
> stashed away in some hiding place in phase 1.
>
> However:
> My reading of the standard is that we don't allow this behavior
> currently, because we tunnel everything through UCNs. So, allowing
> more seems evolutionary, which is probably a different paper.
>
>
> My point is that which character we convert TO in phase 5 isn't mandated.
> If an implementation decides to use information it knows in phase 1 to make decisions in phase 5 it can - and it's not observable how or why that decision was made
> (sure, the byte value can be observed, but not the reasoning the compiler took)

It's implementation-defined, so there's a requirement to document the
phase 5 mapping. I agree that document could say "we map back to the
original character", re-introducing the raw string reversal oddity.

> > Isn't changing "it is implementation-defined whether ucns are formed" to "ucns are formed" evolutionarry?
>
> Yes, it is. Do you have a reference for the "implementation-defined" part
> here? [cpp.concat] doesn't seem to speak to the issue.
>
> Actually, i miss remember, it's described as UB, which is worse (implementation defined was the fixed i proposed to remove the ub; sorry about that) http://eel.is/c++draft/lex.phases#1.4 http://eel.is/c++draft/lex.phases#1.2

Thanks. The phase 4 phrasing should be moved to [cpp.concat] where it belongs,
but otherwise left alone. We don't want to stomp onto SG12's feet.

(We already have undefined behavior there if we form something
that's not a valid pp-token.)

Jens

Received on 2020-06-20 10:45:50