sg16: Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 3 Jun 2020 19:23:21 +0200

On 03/06/2020 03.38, Hubert Tong via SG16 wrote:
> I'm not sure where we are expecting this diagnostic to come into play. If a vendor is dealing with an encoding that has such characters and it is both the source and assumed execution character set, then I doubt they are interested in telling their users that their strings have been outlawed by the committee.

I'm reading [lex.phases] p1.1 as uttering an implied
assumption that any character not in the basic source
character set can be represented as a UCN.

In particular, it seems to prescribe that the implementation
translate any character (except those from the basic source
character set) to UCN. If a valid UCN doesn't exist for that
character, presumably you can't translate that program.

Note that "valid UCN" means "value between 0 and 0x10ffff except
surrogate code points", but does not mean "has an assignment in
Unicode", so an implementation could use values from 0x10ffff
downward (hopefully unassigned by Unicode) for their special
characters. Yup, seems that would work:
https://en.wikipedia.org/wiki/Private_Use_Areas

So, there is a distinction between "Unicode character set"
and "representable as UCN", and the latter appears to offer
enough of an escape hatch to make non-Unicode environments
happy. We shouldn't needlessly stomp on that happiness.

Jens

Received on 2020-06-03 12:26:33