sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 20 Jun 2020 12:14:33 +0200

On 20/06/2020 10.05, Corentin Jabot wrote:
> On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:

> http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
> section 5.2.1

> A. Convert everything to UCNs in basic source characters as soon as possible, that is, in
> translation phase 1.
> B. Use native encodings where possible, UCNs otherwise.
> C. Convert everything to wide characters as soon as possible using an internal encoding that
> encompasses the entire source character set and all UCNs.
>
>
> C++ has chosen model A, C has chosen model B.
> The express intent is that which model is chosen is unobservable
> for a conforming program.
>
> Problems that will be solved with model B:
> - raw string literals don't need some funny "reversal"
> - stringizing can use the original spelling reliably
>
> - fringe characters / encodings beyond Unicode can be transparently passed
> through string literals
>
>
> There is no such thing :)

That seems a point of disagreement. I thought Tom had a list of situations
quite recently where "Unicode alone" isn't enough, for example for certain
Big5 characters or Shift-JIS encodings, if you are in a world that just
wants transparent pass-through in string literals.

Also, C seems to support such transparent pass-through, and I think there
is value to keep the C and C++ lexing behaviors are closely together as
possible.

> In short, C++ should switch to a model B', omitting any mention of "encoding"
> or "multibyte characters" for the early phases.
>
> Details:
>
> - Define "source character set" as having the following distinct elements:
>
> * all Unicode characters (where character means "as identified by a code point")
>
> * invented/hypothetical characters for all other Unicode code points
> (where "Unicode code point" means integer values in the range [0..0x10ffff],
> excluding [0xd800-0xdfff])
> Rationale: We want to be forward-compatible with future Unicode standards
> that add more characters (and thus more assigned code points).
>
> * an implementation-defined set of additional elements
> (this is empty in a Unicode-only world)
>
>
> Again, 2 issues:
> * This describes an internal encoding, not a source encoding. We should not talk about "source" past phase 1

It's still "source code", maybe internally represented, as opposed to
compiled machine code. Given that we already have the term "(basic) source
character set" in the standard, I don't see a need to invent something new.
I'm particularly non-enthused about the phrase "internal encoding" (internal
relative to what?)

> * There is no use case for a super set of Unicode. I described the EBCDIC control character issue to the Unicode mailing list, it was qualified as "daft".

As I said earlier, it appears that Unicode says that control characters
are essentially out-of-scope for them (which I sympathize with, from their
viewpoint), so I would not turn to Unicode for insight how to handle
EBCDIC control characters that don't have a semantic equivalent in
Unicode.

In an EBCDIC-only world, I think there is a real conflict between
an EBCDIC control character mapped to a C1 control character in phase 1
the presence of a UCN naming that same control character somewhere
in the original source code. The presence of the UCN may or may not
be intentional, I would like to allow implementations to flag this
situation.

> All characters that a C++ compiler ever have, does or will care about has a mapping to Unicode.

But possibly not a unique mapping from Unicode back to the original character,
which seems useful for transparent string-literal pass-through.

Tom, I think the question whether there should be allowance for pass-through of
characters beyond Unicode should be up for a straw poll at the next telecon so
that we can make progress here.

> - Define "basic source character set" as a subset of the "source character set"
> with an explicit list of Unicode characters.
>
>
> There is no need for that construct -

These are the use-cases for the term "basic source character set":

- keywords are spelled in the basic source character set

- basic source characters can be represented in a single byte in plain "char" literals

- UCNs denoting characters in the basic source character set are ill-formed
[lex.charset] p2

- Timezone parsing (the table in [time.parse], flag %Z)

- do_widen / do_narrow [locale.ctype.virtuals]

So, the term seems to be useful as a descriptive tool when we're intentionally
referring to a that subset.

> I would actually prefer we don't try to define the execution character set in terms of a basic one which is tied to the internal
> representation.

I don't think the standard needs to talk about internal "representation",
understood as specific code point values, at all, so I don't see the confusion
here.

I don't think we need an execution character set per se, but it seems worthwhile to
be able to say "for this particular small set of ASCII characters, special constraints
for the literal encoding/representation apply".

> We need however to specify that the grammar uses characters from basic latin in case anybody is confused.

Agreed.

> - Translation phase 1 is reduced to
>
> "Physical source file characters are mapped, in an implementation-defined manner,
> to the <del>basic</del> source character set (introducing new-line characters for
> end-of-line indicators) if necessary. The set of physical source file characters
> accepted is implementation-defined."
>
>
> I think the "introducing new-line characters for end-of-line indicators" is confusing for reversal in raw string literals

I'm defining the new-line question as out-of-scope for this particular
endeavor.

> - Add a new phase 4+ that translates UCNs everywhere except
> in raw string literals to (non-basic) source characters.
> (This is needed to retain the status quo behavior that a UCN
> cannot be formed by concatenating string literals.)
>
> Is there a value of not doing it for identifiers and string literals explicitly ?

"identifier" is ambiguous between phase 4 and phase 7 identifiers.
We can't translate UCNs in phase 4 (due to stringizing), but we want
a single spelling in phase 7 (so no confusion arises what goes into
linker symbols etc). Previously, the single spelling was "UCNs
everywhere"; now, the single spelling is "(extended) characters
everywhere".

For string and char literals, it simplifies the treatment a bit,
because we only have to discuss an abstract "source character"
instead of branching off into "oh, and we're mapping UCNs here"
every so often.

Hm... I'm wondering whether ## token concatenation can form a UCN,
e.g. via bla\ ## u ## 0301 . We should make that ill-formed
or somehow prevent interpretation as a UCN.

>> For example, in string literals, we want to allow Latin-1 encoding of umlauts expressed as a Unicode base vowel plus combining mark, if an implementation so chooses.
>
> I think people in the mailing list agreed that individual c-char should be encoded independently (i thought that was your opinion too?), which I have come to agree with.

Fine with me.

> - In phase 5, we should go to "literal encoding" right away:
> There is no point in discussing a "character set" here; all
> we're interested in is a (sequence of) integer values that end
> up in the execution-time scalar value or array object corresponding
> to the source-code literal.
>
>
> Yep, agreed, as long as you find a way to describe that encoding preserves semantic

What semantic? A string literal consists of a sequence of source characters.
At the end, we get a sequence of integer values in an array object.
We can certainly weave a "corresponding" into the process, but that's
essentially vacuous handwaving from a normative standpoint.
(The preceding statement only applies to char and wchar_t encodings,
of course, not to the well-defined UTF-x encodings.)

Jens

Received on 2020-06-20 05:17:48