sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 20 Jun 2020 12:10:54 -0400

On Sat, Jun 20, 2020 at 10:52 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 20/06/2020 15.47, Corentin Jabot wrote:
>
> > What's the implementation strategy for an implementation that wishes
> to provide
> > byte pass-through in string literals under your approach, which
> tunnels everything
> > through Unicode?
> >
> >
> > As long as a source character can be converted to a Unicode character,
> and that Unicode character can be converted back to the same
> original character,
> > does it matter if it was or not?
> >
> > If A1 -> B is a valid transcoding operation, then B -> A1 is a valid
> transcoding operation whether there exists or not a separate B -> A2
> transcoding operation.
> > An implementation strategy would be to keep track of the original
> character (maybe by lexing character by character in the source file
> encoding), or use some other form of tracking. But does that strategy have
> to be specified in the standard?
>
> "Silently keep track of the original character" is what
> we currently have with reverting raw string literals.
> It feels like cheating to require phase 5 to rely on information
> stashed away in some hiding place in phase 1.
>
> However:
> My reading of the standard is that we don't allow this behavior
> currently, because we tunnel everything through UCNs. So, allowing
> more seems evolutionary, which is probably a different paper.
>
> > In particular, "there are numbers greater than 10FFFF that can be used"
> may not be the best implementation strategy.
>
> Sure, but that's for the implementation to decide. After all, we're only
> talking
> about abstract characters, not about numbers.
>
> > > #define CONCAT(x,y) x##y
> > > CONCAT(\, U0001F431);
> > >
> > > Is valid in all implementations I tested, implementation-defined
> in the standard.
> >
> > Is the result the named Unicode character?
> >
> > Ok, so be it. Having this as valid is fall-out from the
> > currently-described approach, and if it's consistent with
> > what implementations already do, we're good.
> >
> > > Do you see a reason to not allow it? in particular, as we move
> ucns handling later
> > > in the process, it would make sense to allow these escape
> sequences to be created in phase 2 and 4 (might be evolutionary, there is a
> paper)
> >
> > I think the status quo already allows creating UCNs like that,
> > so this doesn't seem to be evolutionary at all.
> >
> >
> > Isn't changing "it is implementation-defined whether ucns are formed" to
> "ucns are formed" evolutionarry?
>
> Yes, it is. Do you have a reference for the "implementation-defined" part
> here? [cpp.concat] doesn't seem to speak to the issue.
>
I don't want to throw a wrench into everything; however, this is what I
believe the situation is (with the caveat that the C99 Rationale document
is a product of WG 14 and not of WG 21):

The "status quo" is the result of wording defects. The design intent is
that the three models are isomorphic by way of making it impossible for the
user to observe the differences between models. The undefined behaviour
cases were designed to prevent observance of the model actually used by the
compiler. The removal of the undefined behaviour is a departure from the
original design intent.

The
fopen("\\ubeda\\file.txt","r")
example in the rationale document is meant to indicate that
"\{U+BEDA}"
is problematic as is plain
"\\ubeda"

Similarly, the observability of funnelling through UCNs is a wording defect.

Noting that "undefined behaviour" was the tool of choice at the time the
wording was produced:
In a hypothetical word where the wording had indicated that the presence of
characters not mappable to UCNs is undefined behaviour, the compiler would
have been free to "do the right thing".

>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-20 11:14:22