C++ Logo

sg16

Advanced search

Re: [SG16] Handling of non-basic characters in early translation phases

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Jun 2020 17:28:12 +0200
On Sat, 20 Jun 2020 at 16:52, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 20/06/2020 15.47, Corentin Jabot wrote:
>
> > What's the implementation strategy for an implementation that wishes
> to provide
> > byte pass-through in string literals under your approach, which
> tunnels everything
> > through Unicode?
> >
> >
> > As long as a source character can be converted to a Unicode character,
> and that Unicode character can be converted back to the same
> original character,
> > does it matter if it was or not?
> >
> > If A1 -> B is a valid transcoding operation, then B -> A1 is a valid
> transcoding operation whether there exists or not a separate B -> A2
> transcoding operation.
> > An implementation strategy would be to keep track of the original
> character (maybe by lexing character by character in the source file
> encoding), or use some other form of tracking. But does that strategy have
> to be specified in the standard?
>
> "Silently keep track of the original character" is what
> we currently have with reverting raw string literals.
> It feels like cheating to require phase 5 to rely on information
> stashed away in some hiding place in phase 1.
>
> However:
> My reading of the standard is that we don't allow this behavior
> currently, because we tunnel everything through UCNs. So, allowing
> more seems evolutionary, which is probably a different paper.
>

My point is that which character we convert TO in phase 5 isn't mandated.
If an implementation decides to use information it knows in phase 1 to make
decisions in phase 5 it can - and it's not observable how or why that
decision was made
(sure, the byte value can be observed, but not the reasoning the compiler
took)

(My assumption is that many implementations only pretend to have
translation phases http://eel.is/c++draft/lex.phases#footnote-6 , so
compilers usually do have informations locally)


>
> > In particular, "there are numbers greater than 10FFFF that can be used"
> may not be the best implementation strategy.
>
> Sure, but that's for the implementation to decide. After all, we're only
> talking
> about abstract characters, not about numbers.
>
> > > #define CONCAT(x,y) x##y
> > > CONCAT(\, U0001F431);
> > >
> > > Is valid in all implementations I tested, implementation-defined
> in the standard.
> >
> > Is the result the named Unicode character?
> >
> > Ok, so be it. Having this as valid is fall-out from the
> > currently-described approach, and if it's consistent with
> > what implementations already do, we're good.
> >
> > > Do you see a reason to not allow it? in particular, as we move
> ucns handling later
> > > in the process, it would make sense to allow these escape
> sequences to be created in phase 2 and 4 (might be evolutionary, there is a
> paper)
> >
> > I think the status quo already allows creating UCNs like that,
> > so this doesn't seem to be evolutionary at all.
> >
> >
> > Isn't changing "it is implementation-defined whether ucns are formed" to
> "ucns are formed" evolutionarry?
>
> Yes, it is. Do you have a reference for the "implementation-defined" part
> here? [cpp.concat] doesn't seem to speak to the issue.
>

Actually, i miss remember, it's described as UB, which is worse
(implementation defined was the fixed i proposed to remove the ub; sorry
about that) http://eel.is/c++draft/lex.phases#1.4
http://eel.is/c++draft/lex.phases#1.2

Interestingly C only has UB in phase 4

Phase 2
Each instance of a backslash character (\) immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source line
shall be eligible for being part of such a splice. A source file that is
not empty shall end in a new-line character, which shall not be immediately
preceded by a backslash character before any such splicing takes place.

Phase 4
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence
that matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include
preprocessing directive causes the named header or source file to be
processed from phase 1 through phase 4, recursively. All preprocessing
directives are then deleted.


>
> Jens
>

Received on 2020-06-20 10:31:35