C++ Logo

SG16

Advanced search

Subject: Re: Handling of non-basic characters in early translation phases
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-20 10:45:11


On Sat, 20 Jun 2020 at 17:42, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 20/06/2020 17.28, Corentin Jabot wrote:
> >
> >
> > On Sat, 20 Jun 2020 at 16:52, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 20/06/2020 15.47, Corentin Jabot wrote:
> >
> > > What's the implementation strategy for an implementation that
> wishes to provide
> > > byte pass-through in string literals under your approach,
> which tunnels everything
> > > through Unicode?
> > >
> > >
> > > As long as a source character can be converted to a Unicode
> character, and that Unicode character can be converted back to the same
> original character,
> > > does it matter if it was or not?
> > >
> > > If A1 -> B is a valid transcoding operation, then B -> A1 is a
> valid transcoding operation whether there exists or not a separate B -> A2
> transcoding operation.
> > > An implementation strategy would be to keep track of the original
> character (maybe by lexing character by character in the source file
> encoding), or use some other form of tracking. But does that strategy have
> to be specified in the standard?
> >
> > "Silently keep track of the original character" is what
> > we currently have with reverting raw string literals.
> > It feels like cheating to require phase 5 to rely on information
> > stashed away in some hiding place in phase 1.
> >
> > However:
> > My reading of the standard is that we don't allow this behavior
> > currently, because we tunnel everything through UCNs. So, allowing
> > more seems evolutionary, which is probably a different paper.
> >
> >
> > My point is that which character we convert TO in phase 5 isn't mandated.
> > If an implementation decides to use information it knows in phase 1 to
> make decisions in phase 5 it can - and it's not observable how or why that
> decision was made
> > (sure, the byte value can be observed, but not the reasoning the
> compiler took)
>
> It's implementation-defined, so there's a requirement to document the
> phase 5 mapping. I agree that document could say "we map back to the
> original character", re-introducing the raw string reversal oddity.
>
> > > Isn't changing "it is implementation-defined whether ucns are
> formed" to "ucns are formed" evolutionarry?
> >
> > Yes, it is. Do you have a reference for the
> "implementation-defined" part
> > here? [cpp.concat] doesn't seem to speak to the issue.
> >
> > Actually, i miss remember, it's described as UB, which is worse
> (implementation defined was the fixed i proposed to remove the ub; sorry
> about that) http://eel.is/c++draft/lex.phases#1.4
> http://eel.is/c++draft/lex.phases#1.2
>
> Thanks. The phase 4 phrasing should be moved to [cpp.concat] where it
> belongs,
> but otherwise left alone. We don't want to stomp onto SG12's feet.
>

Unless we make it well defined by virtue of shuffling the phases around,
right?
(again, i suspect it is evolutionary and on JF's plate)

>
> (We already have undefined behavior there if we form something
> that's not a valid pp-token.)
>
> Jens
>



SG16 list run by sg16-owner@lists.isocpp.org