sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 20 Jun 2020 21:50:52 +0200

On 20/06/2020 21.05, Hubert Tong wrote:
> On Sat, Jun 20, 2020 at 1:15 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 20/06/2020 18.10, Hubert Tong wrote:
> > I don't want to throw a wrench into everything; however, this is what I believe the situation is (with the caveat that the C99 Rationale document is a product of WG 14 and not of WG 21):
> >
> > The "status quo" is the result of wording defects. The design intent is that the three models are isomorphic by way of making it impossible for the user to observe the differences between models. The undefined behaviour cases were designed to prevent observance of the model actually used by the compiler. The removal of the undefined behaviour is a departure from the original design intent.
>
> Agreed; the "undefined behavior" needs to stay
> to avoid stepping on SG12's toes.
>
> > The
> > fopen("\\ubeda\\file.txt","r")
> > example in the rationale document is meant to indicate that
> > "\{U+BEDA}"
> > is problematic as is plain
> > "\\ubeda"
>
> Do we have to do something special about that situation?
>
> This is very similar to the existing Core Issue we have in the area. We have a defect in terms of whether we expect the former to produce \U0000BEDA or \ubeda, etc. (although the formation of the UCN is probably undesirable to begin with).

Right, and it seems we'll fix that if we keep input characters
as-is instead of translating to UCNs in phase 1.
> > Similarly, the observability of funnelling through UCNs is a wording defect.
>
> Is it observable anywhere?
>
> An extended character mapped to a UCN during input is required to have the same behaviour as the UCN itself. If we believe that funnelling through UCNs in any way limits the number of characters that can be distinguished by the implementation, then yes: either some characters are conflated or some input characters will be ill-formed. We lose the nuance of being able to treat the extended character as okay to encode into "plain" or wide strings and not desirable to encode into "Unicode" strings.

Good additional point why "implementation-defined set of characters beyond Unicode"
in the specification is a good idea, for those platforms that need the freedom.

Jens

Received on 2020-06-20 14:54:08