sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 20 Jun 2020 15:16:27 +0200

On 20/06/2020 13.54, Corentin Jabot wrote:

> * Should the standard mandates that byte value are preserved? ( I think this would put severe constraints on implementations )

My view: allowed, but not required. But we should acknowledge that
state of affairs by allowing characters beyond Unicode in the source
character set. (In particular, nobody should be forced to map
non-Unicode chars to the Unicode private use area.)

> If we do believe that, then talking of source encoding from phase 1 to 5 is useful, otherwise, if we believe that preservation of byte value is in the domain of QOI, an implementation can choose any path through phase 1 and 5 as long as the _semantic_ is preserved.

I don't think it's useful to talk about "semantic" here,
and I think we can avoid talking about source encoding by
simply allowing enough space in the definition of "source
character set" so that implementation-defined mappings can
avoid non-unique mappings if they so prefer.

> In phase 1, if one source character can be represented by more than one unicode codepoint sequence, an implementation can choose which ( in a scenario where phase 1 is semantically preserving, which we haven't decided to do yet). For example Å (Angstrom sign) can be either U+212B or U+00C5 in phase 1.

Right, and I would expect that an implementation chooses depending on the
physical input encoding, which makes U+00C5 the far more likely choice
(because that's the character that's usually part of Nordic alphabets).

> Similarly, ≒ (U+2252, Approximately Equal to or the Image Of) can be encoded in phase 5 as either 0x8790 or 0x81e0, an implementation can choose which.

My guess is you're referring to Shift-JIS here, and the following statements are a part of this.

> If the internal Character set we choose in the standard is Unicode, the wording would lose the source information such that we wouldn't be able to prescribe
> byte value information, but an implementation could, if it desired.
That last part I don't get. Suppose I'm in a Shift-JIS-only world, my input text
is Shift-JIS, and my input contains the two characters 0x8790 0x81e0 .

If the only available internal characters are Unicode, the implementation
has little choice but to map both to U+2252.

Now, when the wchar_t[] literal object is produced later in translation, the
information is lost which of the two Shift-JIS values you started with.

However, if we allow (but not require) the implementation to (say)
map 0x8790 to U+2252 and 0x81e0 to non-Unicode-char-1 in phase 1, then
(per QoI) a compiler could offer the service to reproduce the original
Shift-JIS values unharmed.

My proposal and my intent is all about allowing (but not requiring) the latter,
but I don't know how to allow this cleanly if we force the compiler to
translate everything to Unicode characters in phase 1.

> Overall the source question can be abstracted away:
> Given the string literal "\u2252", assuming a shift jis narrow encoding, what should its byte value in the program be?
>
> * 0x8790
> * 0x81e0
> * Implementation-defined?

> > Again, 2 issues:
> > * This describes an internal encoding, not a source encoding. We should not talk about "source" past phase 1
>
> It's still "source code", maybe internally represented, as opposed to
> compiled machine code. Given that we already have the term "(basic) source
> character set" in the standard, I don't see a need to invent something new.
> I'm particularly non-enthused about the phrase "internal encoding" (internal
> relative to what?)
>
>
> The goal is to make it clear that it is not the encoding of source files.

I'm not worried about that. First, don't talk about "source encoding", just
about "character set". Also, we already have "physical source files" in phase 1
to distinguish from internal and abstract representations.

> > * There is no use case for a super set of Unicode. I described the EBCDIC control character issue to the Unicode mailing list, it was qualified as "daft".
>
> As I said earlier, it appears that Unicode says that control characters
> are essentially out-of-scope for them (which I sympathize with, from their
> viewpoint), so I would not turn to Unicode for insight how to handle
> EBCDIC control characters that don't have a semantic equivalent in
> Unicode.
>
> In an EBCDIC-only world, I think there is a real conflict between
> an EBCDIC control character mapped to a C1 control character in phase 1
> the presence of a UCN naming that same control character somewhere
> in the original source code. The presence of the UCN may or may not
> be intentional, I would like to allow implementations to flag this
> situation.
>
>
> Their position is that a compiler cannot know what the semantic of a codepoint which is a C1 or C0
> control character is, as they don't have semantic.

Again, they don't have semantic from a Unicode viewpoint (which is fine),
but in a larger system context, they sure have semantics (otherwise
they wouldn't have a reason to exist in the first place). How much of
those semantics is known to the compiler is a separate question.

> A compiler could flag C0/C1 ucn escape sequences in literal, if they wanted too.
> And again I'm trying to be pragmatic here. The work IBM is doing to get clang to support ebcdic is converting that ebcdic to utf-8.

Maybe that's because it's the only option under the status quo of C++,
which needs to tunnel everything through UCNs.

> > All characters that a C++ compiler ever have, does or will care about has a mapping to Unicode.
>
> But possibly not a unique mapping from Unicode back to the original character,
> which seems useful for transparent string-literal pass-through.
>
> Tom, I think the question whether there should be allowance for pass-through of
> characters beyond Unicode should be up for a straw poll at the next telecon so
> that we can make progress here.
>
>
> Again, for the record, my position is "should be allowed, not mandated"

It feels we're in violent agreement here, so the remaining question maybe just is:

What's the implementation strategy for an implementation that wishes to provide
byte pass-through in string literals under your approach, which tunnels everything
through Unicode?

> > - Define "basic source character set" as a subset of the "source character set"
> > with an explicit list of Unicode characters.
> >
> >
> > There is no need for that construct -
>
> These are the use-cases for the term "basic source character set":
>
> - keywords are spelled in the basic source character set
>
> - basic source characters can be represented in a single byte in plain "char" literals
>
>
> - UCNs denoting characters in the basic source character set are ill-formed
> [lex.charset] p2
>
> - Timezone parsing (the table in [time.parse], flag %Z)
>
> - do_widen / do_narrow [locale.ctype.virtuals]
>
> So, the term seems to be useful as a descriptive tool when we're intentionally
> referring to a that subset.
>
>
> I agree that there is a need for a term, but, many of these can be better described in term of, for example "basic literal character set" or more accurately "basic literal character repertoire"

Since we're only talking about an (abstract) set ("repertoire") of characters,
it doesn't really matter whether we apply such a concept to the source code
or the literal-encoded character domains.

If it feels better, we can rename "basic source character set" to "basic character set".

> > I would actually prefer we don't try to define the execution character set in terms of a basic one which is tied to the internal
> > representation.
>
> I don't think the standard needs to talk about internal "representation",
> understood as specific code point values, at all, so I don't see the confusion
> here.
>
> We talk about source to refer to something that is not related to source
... not exclusively related to source, yes.

> I don't think we need an execution character set per se, but it seems worthwhile to
> be able to say "for this particular small set of ASCII characters, special constraints
> for the literal encoding/representation apply".
>
>
> Agreed.
> And I think we might need slightly different definitions for some of the points you cited (which i was aware of, i meant that these things would need to be described differently, not that removing the term would have no ripple effect. My goal is that literals are not defined in term of source

Literals bridge the gap between source and execution, so there is some
relationship.

> "identifier" is ambiguous between phase 4 and phase 7 identifiers.
> We can't translate UCNs in phase 4 (due to stringizing), but we want
> a single spelling in phase 7 (so no confusion arises what goes into
> linker symbols etc). Previously, the single spelling was "UCNs
> everywhere"; now, the single spelling is "(extended) characters
> everywhere".
>
>
> Right, i guess identifiers can be handled in phase 4+ or 7-
> Which mean that pp-identifiers and pp-token can be composed of both ucns escape sequences and xid_start/xid_continue code points

Yes, unfortunately (curse stringizing).

> #define CONCAT(x,y) x##y
> CONCAT(\, U0001F431);
>
> Is valid in all implementations I tested, implementation-defined in the standard.

Is the result the named Unicode character?

Ok, so be it. Having this as valid is fall-out from the
currently-described approach, and if it's consistent with
what implementations already do, we're good.

> Do you see a reason to not allow it? in particular, as we move ucns handling later
> in the process, it would make sense to allow these escape sequences to be created in phase 2 and 4 (might be evolutionary, there is a paper)

I think the status quo already allows creating UCNs like that,
so this doesn't seem to be evolutionary at all.

> What semantic? A string literal consists of a sequence of source characters.
> At the end, we get a sequence of integer values in an array object.
> We can certainly weave a "corresponding" into the process, but that's
> essentially vacuous handwaving from a normative standpoint.
> (The preceding statement only applies to char and wchar_t encodings,
> of course, not to the well-defined UTF-x encodings.)
>
>
> In particular, an implementation can do any conversion it wants, including replacing characters that have no representation by another,
> currently implementation defined behavior, which is something i would like to make ill-formed, i know it's evolutionary, there is a paper.

I'm not trying to make EWG-level changes like that part of my proposal,
in particular since it seems to invalidate current implementation
practice.

Jens

Received on 2020-06-20 08:19:41