C++ Logo


Advanced search

Re: [SG16] Handling of non-basic characters in early translation phases

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 20 Jun 2020 15:05:40 -0400
On Sat, Jun 20, 2020 at 1:15 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 20/06/2020 18.10, Hubert Tong wrote:
> > I don't want to throw a wrench into everything; however, this is what I
> believe the situation is (with the caveat that the C99 Rationale document
> is a product of WG 14 and not of WG 21):
> >
> > The "status quo" is the result of wording defects. The design intent is
> that the three models are isomorphic by way of making it impossible for the
> user to observe the differences between models. The undefined behaviour
> cases were designed to prevent observance of the model actually used by the
> compiler. The removal of the undefined behaviour is a departure from the
> original design intent.
> Agreed; the "undefined behavior" needs to stay
> to avoid stepping on SG12's toes.
> > The
> > fopen("\\ubeda\\file.txt","r")
> > example in the rationale document is meant to indicate that
> > "\{U+BEDA}"
> > is problematic as is plain
> > "\\ubeda"
> Do we have to do something special about that situation?
This is very similar to the existing Core Issue we have in the area. We
have a defect in terms of whether we expect the former to produce
\U0000BEDA or \ubeda, etc. (although the formation of the UCN is probably
undesirable to begin with).

For the latter case, the rationale document seems to say that supporting
Model C means that the apparent UCN could be replaced by the implementation
early. What I believe was the design intent in terms of supporting all of
the models is probably something we're going to step away from in any case.
The reason why I brought it up is that I believe it informs us in terms of
the range of concerns that were "addressed".

> > Similarly, the observability of funnelling through UCNs is a wording
> defect.
> Is it observable anywhere?
An extended character mapped to a UCN during input is required to have the
same behaviour as the UCN itself. If we believe that funnelling through
UCNs in any way limits the number of characters that can be distinguished
by the implementation, then yes: either some characters are conflated or
some input characters will be ill-formed. We lose the nuance of being able
to treat the extended character as okay to encode into "plain" or wide
strings and not desirable to encode into "Unicode" strings.

> Jens

Received on 2020-06-20 14:09:08