sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Jun 2020 10:05:12 +0200

Hello,
Thanks Jens, feeling like we are making progress !
A few comment below

On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

>
> I've had a look at the C99 rationale (thanks to Hubert for the hint)
> with respect to handling non-basic characters in the early
> translation phases.
>
> http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
> section 5.2.1
>
> The terminology used is a bit outdated, for example the term
> "collation sequence" appears to refer to "code point", but the
> choice of options seems informative:
>
> A. Convert everything to UCNs in basic source characters as soon as
> possible, that is, in
> translation phase 1.
> B. Use native encodings where possible, UCNs otherwise.
> C. Convert everything to wide characters as soon as possible using an
> internal encoding that
> encompasses the entire source character set and all UCNs.
>
>
> C++ has chosen model A, C has chosen model B.
> The express intent is that which model is chosen is unobservable
> for a conforming program.
>
> Problems that will be solved with model B:
> - raw string literals don't need some funny "reversal"
> - stringizing can use the original spelling reliably
>

> - fringe characters / encodings beyond Unicode can be transparently passed
> through string literals
>

There is no such thing :)

>
> In short, C++ should switch to a model B', omitting any mention of
> "encoding"
> or "multibyte characters" for the early phases.
>
> Details:
>
> - Define "source character set" as having the following distinct elements:
>
> * all Unicode characters (where character means "as identified by a
> code point")
>
> * invented/hypothetical characters for all other Unicode code points
> (where "Unicode code point" means integer values in the range
> [0..0x10ffff],
> excluding [0xd800-0xdfff])
> Rationale: We want to be forward-compatible with future Unicode standards
> that add more characters (and thus more assigned code points).
>
> * an implementation-defined set of additional elements
> (this is empty in a Unicode-only world)
>

Again, 2 issues:
* This describes an internal encoding, not a source encoding. We should
not talk about "source" past phase 1
* There is no use case for a super set of Unicode. I described the
EBCDIC control character issue to the Unicode mailing list, it was
qualified as "daft". All characters that a C++ compiler ever have, does or
will care about has a mapping to Unicode.

> - Define "basic source character set" as a subset of the "source
> character set"
> with an explicit list of Unicode characters.
>

There is no need for that construct - I would actually prefer we don't try
to define the execution character set in terms of a basic one which is tied
to the internal
representation. We need however to specify that the grammar uses characters
from basic latin in case anybody is confused.

Strongly agreed about being explicit about unicode characters!

>
> - Translation phase 1 is reduced to
>
> "Physical source file characters are mapped, in an implementation-defined
> manner,
> to the <del>basic</del> source character set (introducing new-line
> characters for
> end-of-line indicators) if necessary. The set of physical source file
> characters
> accepted is implementation-defined."
>

I think the "introducing new-line characters for end-of-line indicators" is
confusing for reversal in raw string literals

> - Modify the "identifier" lexing treatment to handle (non-basic)
> source characters and equivalent UCNs the same; we can't fold
> UCNs to source characters just yet because of preprocessor
> stringizing, which wants to recover the "original spelling".
>

Thats a good point i hadn't consider

> - Add a new phase 4+ that translates UCNs everywhere except
> in raw string literals to (non-basic) source characters.
> (This is needed to retain the status quo behavior that a UCN
> cannot be formed by concatenating string literals.)
>

Is there a value of not doing it for identifiers and string literals
explicitly ?

>
> - Revert the order of translation phases 5 and 6: We should concatenate
> string literals first so that (e.g.) combining marks are actually next
> to the character they apply to before converting to the execution
> encoding. For example, in string literals, we want to allow Latin-1
> encoding of umlauts expressed as a Unicode base vowel plus combining mark,
> if an implementation so chooses.

+1. (I talk about that issue in https://wg21.link/p2178r0)

> For example, in string literals, we want to allow Latin-1 encoding of
umlauts expressed as a Unicode base vowel plus combining mark, if an
implementation so chooses.

I think people in the mailing list agreed that individual c-char should be
encoded independently (i thought that was your opinion too?), which I have
come to agree with.

>
> - In phase 5, we should go to "literal encoding" right away:
> There is no point in discussing a "character set" here; all
> we're interested in is a (sequence of) integer values that end
> up in the execution-time scalar value or array object corresponding
> to the source-code literal.
>

Yep, agreed, as long as you find a way to describe that encoding preserves
semantic

>
> - Any mention of "locale-dependent" during compilation should
> be removed: Either this is subsumed by "implementation-defined"
> in phase 1, or it's a concept referring to the runtime locale,
> which is purely a library I/O matter.
>

:)

>
> - Carefully review [lex] and [cpp] for further fall-out adjustments.
> The trouble is that several papers addressing [lex] are in flight,
> for example P2029, which doesn't help contain the conflicts.
>

Yep, organizationally this is a bit of a nightmare :(

>
> This approach does fix the UCN reversal in raw string literals, but does
> not fix the line splicing reversal for same. The latter is a separate
> can of worms, in my view.

A much less confusing can

As a matter of editorial clarity, we should use the prefix "Unicode" for
> any term we intend to use unmodified from the Unicode standard,
> e.g. "Unicode code point".
>

We should never refer to terms "code point" without referring to an
explicit character set, nor "code unit" without referring to an explicit
encoding

> If the term "character set" is too loaded and transports more meaning
> than the intended "(abstract) set of (abstract) characters", [lex.charset]
> needs a larger rewrite. I'm not sold on that.
>

No, it's fine

>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-20 03:08:35