sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Jun 2020 13:54:24 +0200

On Sat, 20 Jun 2020 at 12:14, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 20/06/2020 10.05, Corentin Jabot wrote:
> > On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> > http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
> > section 5.2.1
>
> > A. Convert everything to UCNs in basic source characters as soon as
> possible, that is, in
> > translation phase 1.
> > B. Use native encodings where possible, UCNs otherwise.
> > C. Convert everything to wide characters as soon as possible using
> an internal encoding that
> > encompasses the entire source character set and all UCNs.
> >
> >
> > C++ has chosen model A, C has chosen model B.
> > The express intent is that which model is chosen is unobservable
> > for a conforming program.
> >
> > Problems that will be solved with model B:
> > - raw string literals don't need some funny "reversal"
> > - stringizing can use the original spelling reliably
> >
> > - fringe characters / encodings beyond Unicode can be transparently
> passed
> > through string literals
> >
> >
> > There is no such thing :)
>
> That seems a point of disagreement. I thought Tom had a list of situations
> quite recently where "Unicode alone" isn't enough, for example for certain
> Big5 characters or Shift-JIS encodings, if you are in a world that just
> wants transparent pass-through in string literals.
>

I agree on supporting transparent pass-through (note that not all compilers
do that)
There are 3 different things here, which should not be mixed up:

   - Can a character be represented in the unicode code space. The
   exception to that is some amount of some big5 characters, that no compiler
   supports ( there exist many character sets called big5, windows "big5" code
   page has a complete mapping to Unicode, same thing for HKSCS, both of
   which are widely used.
   - Can a character be represented in the unicode code space by a
   character that has the exact same semantic. we can argue whether that is
   the case for a subset of EBCDIC characters that map to unicode codepoints
   whose semantic is "application defined". But there is a mapping. The wider
   point is that Unicode control characters do not have semantics of their own
   and in that case Unicode acts as a pass through. The scenario in which an
   implementation wanted to support the semantic of multiple source character
   set control characters is exotic at best. Nevertheless the good people of
   Unicode pointed me towards
   http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
which
   is a general mechanism to encode arbitrary numbers of control character
   sequences, if an implementation wanted to do that.
   - Can a character UNIQUELY map TO Unicode and can a character UNIQUELY
   map FROM Unicode. This is not the case for for example Shift-JIS, which for
   historical reasons has duplicated characters, and unicode itself has
   duplicate characters. Because of these duplication, a naive implementation
   that would follow the steps in order and do a verbatim mapping source ->
   unicode -> execution may lose information (aka which of the duplicated
   characters what used). The information lost is of the byte value rather
   than of the semantic.

There is a very important question here:

   - Should the standard mandates that byte value are preserved? ( I think
   this would put severe constraints on implementations )

If we do believe that, then talking of source encoding from phase 1 to 5 is
useful, otherwise, if we believe that preservation of byte value is in the
domain of QOI, an implementation can choose any path through phase 1 and 5
as long as the _semantic_ is preserved.
In phase 1, if one source character can be represented by more than one
unicode codepoint sequence, an implementation can choose which ( in a
scenario where phase 1 is semantically preserving, which we haven't decided
to do yet). For example Å (Angstrom sign) can be either U+212B or U+00C5 in
phase 1.

Similarly, ≒ (U+2252, Approximately Equal to or the Image Of) can be
encoded in phase 5 as either 0x8790 or 0x81e0, an implementation can choose
which.

If the internal Character set we choose in the standard is Unicode, the
wording would lose the source information such that we wouldn't be able to
prescribe
byte value information, but an implementation could, if it desired.

Overall the source question can be abstracted away:
Given the string literal "\u2252", assuming a shift jis narrow encoding,
what should its byte value in the program be?

   - 0x8790
   - 0x81e0
   - Implementation-defined?

> Also, C seems to support such transparent pass-through, and I think there
> is value to keep the C and C++ lexing behaviors are closely together as
> possible.

> > In short, C++ should switch to a model B', omitting any mention of
> "encoding"
> > or "multibyte characters" for the early phases.
> >
> > Details:
> >
> > - Define "source character set" as having the following distinct
> elements:
> >
> > * all Unicode characters (where character means "as identified
> by a code point")
> >
> > * invented/hypothetical characters for all other Unicode code
> points
> > (where "Unicode code point" means integer values in the range
> [0..0x10ffff],
> > excluding [0xd800-0xdfff])
> > Rationale: We want to be forward-compatible with future Unicode
> standards
> > that add more characters (and thus more assigned code points).
> >
> > * an implementation-defined set of additional elements
> > (this is empty in a Unicode-only world)
> >
> >
> > Again, 2 issues:
> > * This describes an internal encoding, not a source encoding. We
> should not talk about "source" past phase 1
>
> It's still "source code", maybe internally represented, as opposed to
> compiled machine code. Given that we already have the term "(basic) source
> character set" in the standard, I don't see a need to invent something new.
> I'm particularly non-enthused about the phrase "internal encoding"
> (internal
> relative to what?)
>

The goal is to make it clear that it is not the encoding of source files.

>
> > * There is no use case for a super set of Unicode. I described the
> EBCDIC control character issue to the Unicode mailing list, it was
> qualified as "daft".
>
> As I said earlier, it appears that Unicode says that control characters
> are essentially out-of-scope for them (which I sympathize with, from their
> viewpoint), so I would not turn to Unicode for insight how to handle
> EBCDIC control characters that don't have a semantic equivalent in
> Unicode.
>
> In an EBCDIC-only world, I think there is a real conflict between
> an EBCDIC control character mapped to a C1 control character in phase 1
> the presence of a UCN naming that same control character somewhere
> in the original source code. The presence of the UCN may or may not
> be intentional, I would like to allow implementations to flag this
> situation.
>

Their position is that a compiler cannot know what the semantic of a
codepoint which is a C1 or C0
control character is, as they don't have semantic.
A compiler could flag C0/C1 ucn escape sequences in literal, if they wanted
too.
And again I'm trying to be pragmatic here. The work IBM is doing to get
clang to support ebcdic is converting that ebcdic to utf-8.

>
> > All characters that a C++ compiler ever have, does or will care about
> has a mapping to Unicode.
>
> But possibly not a unique mapping from Unicode back to the original
> character,
> which seems useful for transparent string-literal pass-through.
>
> Tom, I think the question whether there should be allowance for
> pass-through of
> characters beyond Unicode should be up for a straw poll at the next
> telecon so
> that we can make progress here.
>

Again, for the record, my position is "should be allowed, not mandated"
But i agree polling might help

>
> > - Define "basic source character set" as a subset of the "source
> character set"
> > with an explicit list of Unicode characters.
> >
> >
> > There is no need for that construct -
>
> These are the use-cases for the term "basic source character set":
>
> - keywords are spelled in the basic source character set
>
> - basic source characters can be represented in a single byte in plain
> "char" literals

> - UCNs denoting characters in the basic source character set are
> ill-formed
> [lex.charset] p2
>
> - Timezone parsing (the table in [time.parse], flag %Z)
>
> - do_widen / do_narrow [locale.ctype.virtuals]

So, the term seems to be useful as a descriptive tool when we're
> intentionally
> referring to a that subset.
>

I agree that there is a need for a term, but, many of these can be better
described in term of, for example "basic literal character set" or more
accurately "basic literal character repertoire"

>
> > I would actually prefer we don't try to define the execution character
> set in terms of a basic one which is tied to the internal
> > representation.
>
> I don't think the standard needs to talk about internal "representation",
> understood as specific code point values, at all, so I don't see the
> confusion
> here.
>

We talk about source to refer to something that is not related to source

> I don't think we need an execution character set per se, but it seems
> worthwhile to
> be able to say "for this particular small set of ASCII characters, special
> constraints
> for the literal encoding/representation apply".
>

Agreed.
And I think we might need slightly different definitions for some of the
points you cited (which i was aware of, i meant that these things would
need to be described differently, not that removing the term would have no
ripple effect. My goal is that literals are not defined in term of source

>
> > We need however to specify that the grammar uses characters from basic
> latin in case anybody is confused.
>
> Agreed.
>
> > - Translation phase 1 is reduced to
> >
> > "Physical source file characters are mapped, in an
> implementation-defined manner,
> > to the <del>basic</del> source character set (introducing new-line
> characters for
> > end-of-line indicators) if necessary. The set of physical source
> file characters
> > accepted is implementation-defined."
> >
> >
> > I think the "introducing new-line characters for end-of-line indicators"
> is confusing for reversal in raw string literals
>
> I'm defining the new-line question as out-of-scope for this particular
> endeavor.
>
> > - Add a new phase 4+ that translates UCNs everywhere except
> > in raw string literals to (non-basic) source characters.
> > (This is needed to retain the status quo behavior that a UCN
> > cannot be formed by concatenating string literals.)
> >
> > Is there a value of not doing it for identifiers and string literals
> explicitly ?
>
> "identifier" is ambiguous between phase 4 and phase 7 identifiers.
> We can't translate UCNs in phase 4 (due to stringizing), but we want
> a single spelling in phase 7 (so no confusion arises what goes into
> linker symbols etc). Previously, the single spelling was "UCNs
> everywhere"; now, the single spelling is "(extended) characters
> everywhere".
>

Right, i guess identifiers can be handled in phase 4+ or 7-
Which mean that pp-identifiers and pp-token can be composed of both ucns
escape sequences and xid_start/xid_continue code points

>
> For string and char literals, it simplifies the treatment a bit,
> because we only have to discuss an abstract "source character"
> instead of branching off into "oh, and we're mapping UCNs here"
> every so often.
>
> Hm... I'm wondering whether ## token concatenation can form a UCN,
> e.g. via bla\ ## u ## 0301 . We should make that ill-formed
> or somehow prevent interpretation as a UCN.
>

#define CONCAT(x,y) x##y
CONCAT(\, U0001F431);

Is valid in all implementations I tested, implementation-defined in the
standard.
Do you see a reason to not allow it? in particular, as we move ucns
handling later
in the process, it would make sense to allow these escape sequences to be
created in phase 2 and 4 (might be evolutionary, there is a paper)

> >> For example, in string literals, we want to allow Latin-1 encoding of
> umlauts expressed as a Unicode base vowel plus combining mark, if an
> implementation so chooses.
> >
> > I think people in the mailing list agreed that individual c-char should
> be encoded independently (i thought that was your opinion too?), which I
> have come to agree with.
>
> Fine with me.
>
> > - In phase 5, we should go to "literal encoding" right away:
> > There is no point in discussing a "character set" here; all
> > we're interested in is a (sequence of) integer values that end
> > up in the execution-time scalar value or array object corresponding
> > to the source-code literal.
> >
> >
> > Yep, agreed, as long as you find a way to describe that encoding
> preserves semantic
>
> What semantic? A string literal consists of a sequence of source
> characters.
> At the end, we get a sequence of integer values in an array object.
> We can certainly weave a "corresponding" into the process, but that's
> essentially vacuous handwaving from a normative standpoint.
> (The preceding statement only applies to char and wchar_t encodings,
> of course, not to the well-defined UTF-x encodings.)
>

In particular, an implementation can do any conversion it wants, including
replacing characters that have no representation by another,
currently implementation defined behavior, which is something i would like
to make ill-formed, i know it's evolutionary, there is a paper.

>
> Jens
>

Received on 2020-06-20 06:57:48