sg16: [SG16] Handling of non-basic characters in early translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 20 Jun 2020 09:31:42 +0200

I've had a look at the C99 rationale (thanks to Hubert for the hint)
with respect to handling non-basic characters in the early
translation phases.

http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
section 5.2.1

The terminology used is a bit outdated, for example the term
"collation sequence" appears to refer to "code point", but the
choice of options seems informative:

A. Convert everything to UCNs in basic source characters as soon as possible, that is, in
translation phase 1.
B. Use native encodings where possible, UCNs otherwise.
C. Convert everything to wide characters as soon as possible using an internal encoding that
encompasses the entire source character set and all UCNs.

C++ has chosen model A, C has chosen model B.
The express intent is that which model is chosen is unobservable
for a conforming program.

Problems that will be solved with model B:
- raw string literals don't need some funny "reversal"
- stringizing can use the original spelling reliably
- fringe characters / encodings beyond Unicode can be transparently passed
through string literals

In short, C++ should switch to a model B', omitting any mention of "encoding"
or "multibyte characters" for the early phases.

Details:

- Define "source character set" as having the following distinct elements:

    * all Unicode characters (where character means "as identified by a code point")

    * invented/hypothetical characters for all other Unicode code points
(where "Unicode code point" means integer values in the range [0..0x10ffff],
excluding [0xd800-0xdfff])
Rationale: We want to be forward-compatible with future Unicode standards
that add more characters (and thus more assigned code points).

    * an implementation-defined set of additional elements
(this is empty in a Unicode-only world)

- Define "basic source character set" as a subset of the "source character set"
with an explicit list of Unicode characters.

- Translation phase 1 is reduced to

"Physical source file characters are mapped, in an implementation-defined manner,
to the <del>basic</del> source character set (introducing new-line characters for
end-of-line indicators) if necessary. The set of physical source file characters
accepted is implementation-defined."

- Modify the "identifier" lexing treatment to handle (non-basic)
source characters and equivalent UCNs the same; we can't fold
UCNs to source characters just yet because of preprocessor
stringizing, which wants to recover the "original spelling".

- Add a new phase 4+ that translates UCNs everywhere except
in raw string literals to (non-basic) source characters.
(This is needed to retain the status quo behavior that a UCN
cannot be formed by concatenating string literals.)

- Revert the order of translation phases 5 and 6: We should concatenate
string literals first so that (e.g.) combining marks are actually next
to the character they apply to before converting to the execution
encoding. For example, in string literals, we want to allow Latin-1
encoding of umlauts expressed as a Unicode base vowel plus combining mark,
if an implementation so chooses.

- In phase 5, we should go to "literal encoding" right away:
There is no point in discussing a "character set" here; all
we're interested in is a (sequence of) integer values that end
up in the execution-time scalar value or array object corresponding
to the source-code literal.

- Any mention of "locale-dependent" during compilation should
be removed: Either this is subsumed by "implementation-defined"
in phase 1, or it's a concept referring to the runtime locale,
which is purely a library I/O matter.

- Carefully review [lex] and [cpp] for further fall-out adjustments.
The trouble is that several papers addressing [lex] are in flight,
for example P2029, which doesn't help contain the conflicts.

This approach does fix the UCN reversal in raw string literals, but does
not fix the line splicing reversal for same. The latter is a separate
can of worms, in my view.

As a matter of editorial clarity, we should use the prefix "Unicode" for
any term we intend to use unmodified from the Unicode standard,
e.g. "Unicode code point".

If the term "character set" is too loaded and transports more meaning
than the intended "(abstract) set of (abstract) characters", [lex.charset]
needs a larger rewrite. I'm not sold on that.

Jens

Received on 2020-06-20 02:34:55