sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Jun 2020 15:47:55 +0200

On Sat, 20 Jun 2020 at 15:16, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 20/06/2020 13.54, Corentin Jabot wrote:
>
> > * Should the standard mandates that byte value are preserved? ( I
> think this would put severe constraints on implementations )
>
> My view: allowed, but not required. But we should acknowledge that
> state of affairs by allowing characters beyond Unicode in the source
> character set. (In particular, nobody should be forced to map
> non-Unicode chars to the Unicode private use area.)
>
> > If we do believe that, then talking of source encoding from phase 1 to 5
> is useful, otherwise, if we believe that preservation of byte value is in
> the domain of QOI, an implementation can choose any path through phase 1
> and 5 as long as the _semantic_ is preserved.
>
> I don't think it's useful to talk about "semantic" here,
> and I think we can avoid talking about source encoding by
> simply allowing enough space in the definition of "source
> character set" so that implementation-defined mappings can
> avoid non-unique mappings if they so prefer.
>

semantic == preservation of abstract characters ( not replacing a A by a B )

>
> > In phase 1, if one source character can be represented by more than one
> unicode codepoint sequence, an implementation can choose which ( in a
> scenario where phase 1 is semantically preserving, which we haven't decided
> to do yet). For example Å (Angstrom sign) can be either U+212B or U+00C5 in
> phase 1.
>
> Right, and I would expect that an implementation chooses depending on the
> physical input encoding, which makes U+00C5 the far more likely choice
> (because that's the character that's usually part of Nordic alphabets).
>
> > Similarly, ≒ (U+2252, Approximately Equal to or the Image Of) can be
> encoded in phase 5 as either 0x8790 or 0x81e0, an implementation can choose
> which.
>
> My guess is you're referring to Shift-JIS here, and the following
> statements are a part of this.
>

Oups, glad you understood despite my glaring omission

>
> > If the internal Character set we choose in the standard is Unicode, the
> wording would lose the source information such that we wouldn't be able to
> prescribe
> > byte value information, but an implementation could, if it desired.
> That last part I don't get. Suppose I'm in a Shift-JIS-only world, my
> input text
> is Shift-JIS, and my input contains the two characters 0x8790 0x81e0 .
>
> If the only available internal characters are Unicode, the implementation
> has little choice but to map both to U+2252.
>
> Now, when the wchar_t[] literal object is produced later in translation,
> the
> information is lost which of the two Shift-JIS values you started with.
>
> However, if we allow (but not require) the implementation to (say)
> map 0x8790 to U+2252 and 0x81e0 to non-Unicode-char-1 in phase 1, then
> (per QoI) a compiler could offer the service to reproduce the original
> Shift-JIS values unharmed.
>

Alternatively, a compiler can, per QoL, map U+2252 to either 0x8790
or 0x81e0 in phase 5, and does the standard need to know why or how that
decision was made?

>
> My proposal and my intent is all about allowing (but not requiring) the
> latter,
> but I don't know how to allow this cleanly if we force the compiler to
> translate everything to Unicode characters in phase 1.
>
> > Overall the source question can be abstracted away:
> > Given the string literal "\u2252", assuming a shift jis narrow encoding,
> what should its byte value in the program be?
> >
> > * 0x8790
> > * 0x81e0
> > * Implementation-defined?
>
> > > Again, 2 issues:
> > > * This describes an internal encoding, not a source encoding. We
> should not talk about "source" past phase 1
> >
> > It's still "source code", maybe internally represented, as opposed to
> > compiled machine code. Given that we already have the term "(basic)
> source
> > character set" in the standard, I don't see a need to invent
> something new.
> > I'm particularly non-enthused about the phrase "internal encoding"
> (internal
> > relative to what?)
> >
> >
> > The goal is to make it clear that it is not the encoding of source files.
>
> I'm not worried about that. First, don't talk about "source encoding", just
> about "character set". Also, we already have "physical source files" in
> phase 1
> to distinguish from internal and abstract representations.
>

I know that the standard is not a tutorial and all, but, "source" seems to
imply a relation to source files, even if none is meant.

>
> > > * There is no use case for a super set of Unicode. I described
> the EBCDIC control character issue to the Unicode mailing list, it was
> qualified as "daft".
> >
> > As I said earlier, it appears that Unicode says that control
> characters
> > are essentially out-of-scope for them (which I sympathize with, from
> their
> > viewpoint), so I would not turn to Unicode for insight how to handle
> > EBCDIC control characters that don't have a semantic equivalent in
> > Unicode.
> >
> > In an EBCDIC-only world, I think there is a real conflict between
> > an EBCDIC control character mapped to a C1 control character in
> phase 1
> > the presence of a UCN naming that same control character somewhere
> > in the original source code. The presence of the UCN may or may not
> > be intentional, I would like to allow implementations to flag this
> > situation.
> >
> >
> > Their position is that a compiler cannot know what the semantic of a
> codepoint which is a C1 or C0
> > control character is, as they don't have semantic.
>
> Again, they don't have semantic from a Unicode viewpoint (which is fine),
> but in a larger system context, they sure have semantics (otherwise
> they wouldn't have a reason to exist in the first place). How much of
> those semantics is known to the compiler is a separate question.
>
> > A compiler could flag C0/C1 ucn escape sequences in literal, if they
> wanted too.
> > And again I'm trying to be pragmatic here. The work IBM is doing to get
> clang to support ebcdic is converting that ebcdic to utf-8.
>
> Maybe that's because it's the only option under the status quo of C++,
> which needs to tunnel everything through UCNs.
>

I think it's more about the cost/benefits of supporting that use case.
I really would like to know from IBM people if and how much they are
actually concerned about this point as it is driving many decisions.

>
> > > All characters that a C++ compiler ever have, does or will care
> about has a mapping to Unicode.
> >
> > But possibly not a unique mapping from Unicode back to the original
> character,
> > which seems useful for transparent string-literal pass-through.
> >
> > Tom, I think the question whether there should be allowance for
> pass-through of
> > characters beyond Unicode should be up for a straw poll at the next
> telecon so
> > that we can make progress here.
> >
> >
> > Again, for the record, my position is "should be allowed, not mandated"
>
> It feels we're in violent agreement here, so the remaining question maybe
> just is:
>
> What's the implementation strategy for an implementation that wishes to
> provide
> byte pass-through in string literals under your approach, which tunnels
> everything
> through Unicode?
>

As long as a source character can be converted to a Unicode character, and
that Unicode character can be converted back to the same original character,
does it matter if it was or not?

If A1 -> B is a valid transcoding operation, then B -> A1 is a valid
transcoding operation whether there exists or not a separate B -> A2
transcoding operation.
An implementation strategy would be to keep track of the original character
(maybe by lexing character by character in the source file encoding), or
use some other form of tracking. But does that strategy have to be
specified in the standard?
In particular, "there are numbers greater than 10FFFF that can be used" may
not be the best implementation strategy.

>
> > > - Define "basic source character set" as a subset of the
> "source character set"
> > > with an explicit list of Unicode characters.
> > >
> > >
> > > There is no need for that construct -
> >
> > These are the use-cases for the term "basic source character set":
> >
> > - keywords are spelled in the basic source character set
> >
> > - basic source characters can be represented in a single byte in
> plain "char" literals
> >
> >
> > - UCNs denoting characters in the basic source character set are
> ill-formed
> > [lex.charset] p2
> >
> > - Timezone parsing (the table in [time.parse], flag %Z)
> >
> > - do_widen / do_narrow [locale.ctype.virtuals]
> >
> > So, the term seems to be useful as a descriptive tool when we're
> intentionally
> > referring to a that subset.
> >
> >
> > I agree that there is a need for a term, but, many of these can be
> better described in term of, for example "basic literal character set" or
> more accurately "basic literal character repertoire"
>
> Since we're only talking about an (abstract) set ("repertoire") of
> characters,
> it doesn't really matter whether we apply such a concept to the source code
> or the literal-encoded character domains.
>
> If it feels better, we can rename "basic source character set" to "basic
> character set".
>

+1

>
> > > I would actually prefer we don't try to define the execution
> character set in terms of a basic one which is tied to the internal
> > > representation.
> >
> > I don't think the standard needs to talk about internal
> "representation",
> > understood as specific code point values, at all, so I don't see the
> confusion
> > here.
> >
> > We talk about source to refer to something that is not related to source
> ... not exclusively related to source, yes.
>
> > I don't think we need an execution character set per se, but it
> seems worthwhile to
> > be able to say "for this particular small set of ASCII characters,
> special constraints
> > for the literal encoding/representation apply".
> >
> >
> > Agreed.
> > And I think we might need slightly different definitions for some of the
> points you cited (which i was aware of, i meant that these things would
> need to be described differently, not that removing the term would have no
> ripple effect. My goal is that literals are not defined in term of source
>
> Literals bridge the gap between source and execution, so there is some
> relationship.
>
> > "identifier" is ambiguous between phase 4 and phase 7 identifiers.
> > We can't translate UCNs in phase 4 (due to stringizing), but we want
> > a single spelling in phase 7 (so no confusion arises what goes into
> > linker symbols etc). Previously, the single spelling was "UCNs
> > everywhere"; now, the single spelling is "(extended) characters
> > everywhere".
> >
> >
> > Right, i guess identifiers can be handled in phase 4+ or 7-
> > Which mean that pp-identifiers and pp-token can be composed of both ucns
> escape sequences and xid_start/xid_continue code points
>
> Yes, unfortunately (curse stringizing).
>
> > #define CONCAT(x,y) x##y
> > CONCAT(\, U0001F431);
> >
> > Is valid in all implementations I tested, implementation-defined in the
> standard.
>
> Is the result the named Unicode character?
>
> Ok, so be it. Having this as valid is fall-out from the
> currently-described approach, and if it's consistent with
> what implementations already do, we're good.
>
> > Do you see a reason to not allow it? in particular, as we move ucns
> handling later
> > in the process, it would make sense to allow these escape sequences to
> be created in phase 2 and 4 (might be evolutionary, there is a paper)
>
> I think the status quo already allows creating UCNs like that,
> so this doesn't seem to be evolutionary at all.
>

Isn't changing "it is implementation-defined whether ucns are formed" to
"ucns are formed" evolutionarry?
I still don't really grasp the threshold for ewg involvement.

> > What semantic? A string literal consists of a sequence of source
> characters.
> > At the end, we get a sequence of integer values in an array object.
> > We can certainly weave a "corresponding" into the process, but that's
> > essentially vacuous handwaving from a normative standpoint.
> > (The preceding statement only applies to char and wchar_t encodings,
> > of course, not to the well-defined UTF-x encodings.)
> >
> >
> > In particular, an implementation can do any conversion it wants,
> including replacing characters that have no representation by another,
> > currently implementation defined behavior, which is something i would
> like to make ill-formed, i know it's evolutionary, there is a paper.
>
> I'm not trying to make EWG-level changes like that part of my proposal,
> in particular since it seems to invalidate current implementation
> practice.
>

Yes, i know - and it's a good thing to be able to make progress on wording
independently of design changes !

>
> Jens
>

Received on 2020-06-20 08:51:18