sg16: Re: [SG16] Handling of non-basic characters in early translation phases

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Jun 2020 15:16:37 +0200

More out loud thinking.

There is a simple model where everything is done in the source character
set directly in phase 1 to 5,
We don't need to describe what characters are in that set, because there is
no process by which phase 1-5 creates extra characters.
And we do not need to talk about ucns before phase 4+ and 5.

Of course, we keep describing the grammar in our abstract repertoire +/-
isomorphic to the Basic Latin 1 block.
Which I think is closer to what you describe Jens.

Of course, identifiers will have to be converted to unicode, and all other
grammar elements will have to be referred to in terms of Unicode,
but we preserve byte value, and avoid debates about the semantic nature of
ebcdic control characters.

Most importantly, we avoid the weird "super set of unicode" thingy.

The main drawback of that is that it makes Tom's proposal of a per-header
encoding mechanism (such that each included header has its own source
encoding)
a lot more difficult to realize - at least phase 5 would have to be
repeated for each included headers.

And I don't think it helps describe implementations - a conversion to
unicode does happen at some point on several implementations
Another issue is that we would need to describe the mapping from source to
unicode in phase 4 (for identifiers) and phase 5(for utf literals notably)

On Sat, 20 Jun 2020 at 13:54, Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Sat, 20 Jun 2020 at 12:14, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 20/06/2020 10.05, Corentin Jabot wrote:
>> > On Sat, 20 Jun 2020 at 09:31, Jens Maurer via SG16 <
>> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> > http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
>> > section 5.2.1
>>
>> > A. Convert everything to UCNs in basic source characters as soon as
>> possible, that is, in
>> > translation phase 1.
>> > B. Use native encodings where possible, UCNs otherwise.
>> > C. Convert everything to wide characters as soon as possible using
>> an internal encoding that
>> > encompasses the entire source character set and all UCNs.
>> >
>> >
>> > C++ has chosen model A, C has chosen model B.
>> > The express intent is that which model is chosen is unobservable
>> > for a conforming program.
>> >
>> > Problems that will be solved with model B:
>> > - raw string literals don't need some funny "reversal"
>> > - stringizing can use the original spelling reliably
>> >
>> > - fringe characters / encodings beyond Unicode can be
>> transparently passed
>> > through string literals
>> >
>> >
>> > There is no such thing :)
>>
>> That seems a point of disagreement. I thought Tom had a list of
>> situations
>> quite recently where "Unicode alone" isn't enough, for example for certain
>> Big5 characters or Shift-JIS encodings, if you are in a world that just
>> wants transparent pass-through in string literals.
>>
>
> I agree on supporting transparent pass-through (note that not all
> compilers do that)
> There are 3 different things here, which should not be mixed up:
>
> - Can a character be represented in the unicode code space. The
> exception to that is some amount of some big5 characters, that no compiler
> supports ( there exist many character sets called big5, windows "big5" code
> page has a complete mapping to Unicode, same thing for HKSCS, both of
> which are widely used.
> - Can a character be represented in the unicode code space by a
> character that has the exact same semantic. we can argue whether that is
> the case for a subset of EBCDIC characters that map to unicode codepoints
> whose semantic is "application defined". But there is a mapping. The wider
> point is that Unicode control characters do not have semantics of their own
> and in that case Unicode acts as a pass through. The scenario in which an
> implementation wanted to support the semantic of multiple source character
> set control characters is exotic at best. Nevertheless the good people of
> Unicode pointed me towards
> http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf which
> is a general mechanism to encode arbitrary numbers of control character
> sequences, if an implementation wanted to do that.
> - Can a character UNIQUELY map TO Unicode and can a character UNIQUELY
> map FROM Unicode. This is not the case for for example Shift-JIS, which for
> historical reasons has duplicated characters, and unicode itself has
> duplicate characters. Because of these duplication, a naive implementation
> that would follow the steps in order and do a verbatim mapping source ->
> unicode -> execution may lose information (aka which of the duplicated
> characters what used). The information lost is of the byte value rather
> than of the semantic.
>
> There is a very important question here:
>
> - Should the standard mandates that byte value are preserved? ( I
> think this would put severe constraints on implementations )
>
> If we do believe that, then talking of source encoding from phase 1 to 5
> is useful, otherwise, if we believe that preservation of byte value is in
> the domain of QOI, an implementation can choose any path through phase 1
> and 5 as long as the _semantic_ is preserved.
> In phase 1, if one source character can be represented by more than one
> unicode codepoint sequence, an implementation can choose which ( in a
> scenario where phase 1 is semantically preserving, which we haven't decided
> to do yet). For example Å (Angstrom sign) can be either U+212B or U+00C5 in
> phase 1.
>
> Similarly, ≒ (U+2252, Approximately Equal to or the Image Of) can be
> encoded in phase 5 as either 0x8790 or 0x81e0, an implementation can choose
> which.
>
> If the internal Character set we choose in the standard is Unicode, the
> wording would lose the source information such that we wouldn't be able to
> prescribe
> byte value information, but an implementation could, if it desired.
>
> Overall the source question can be abstracted away:
> Given the string literal "\u2252", assuming a shift jis narrow encoding,
> what should its byte value in the program be?
>
> - 0x8790
> - 0x81e0
> - Implementation-defined?
>
>
>
>
>> Also, C seems to support such transparent pass-through, and I think there
>> is value to keep the C and C++ lexing behaviors are closely together as
>> possible.
>
>
>> > In short, C++ should switch to a model B', omitting any mention of
>> "encoding"
>> > or "multibyte characters" for the early phases.
>> >
>> > Details:
>> >
>> > - Define "source character set" as having the following distinct
>> elements:
>> >
>> > * all Unicode characters (where character means "as identified
>> by a code point")
>> >
>> > * invented/hypothetical characters for all other Unicode code
>> points
>> > (where "Unicode code point" means integer values in the range
>> [0..0x10ffff],
>> > excluding [0xd800-0xdfff])
>> > Rationale: We want to be forward-compatible with future Unicode
>> standards
>> > that add more characters (and thus more assigned code points).
>> >
>> > * an implementation-defined set of additional elements
>> > (this is empty in a Unicode-only world)
>> >
>> >
>> > Again, 2 issues:
>> > * This describes an internal encoding, not a source encoding. We
>> should not talk about "source" past phase 1
>>
>> It's still "source code", maybe internally represented, as opposed to
>> compiled machine code. Given that we already have the term "(basic)
>> source
>> character set" in the standard, I don't see a need to invent something
>> new.
>> I'm particularly non-enthused about the phrase "internal encoding"
>> (internal
>> relative to what?)
>>
>
> The goal is to make it clear that it is not the encoding of source files.
>
>
>>
>> > * There is no use case for a super set of Unicode. I described the
>> EBCDIC control character issue to the Unicode mailing list, it was
>> qualified as "daft".
>>
>> As I said earlier, it appears that Unicode says that control characters
>> are essentially out-of-scope for them (which I sympathize with, from their
>> viewpoint), so I would not turn to Unicode for insight how to handle
>> EBCDIC control characters that don't have a semantic equivalent in
>> Unicode.
>>
>> In an EBCDIC-only world, I think there is a real conflict between
>> an EBCDIC control character mapped to a C1 control character in phase 1
>> the presence of a UCN naming that same control character somewhere
>> in the original source code. The presence of the UCN may or may not
>> be intentional, I would like to allow implementations to flag this
>> situation.
>>
>
> Their position is that a compiler cannot know what the semantic of a
> codepoint which is a C1 or C0
> control character is, as they don't have semantic.
> A compiler could flag C0/C1 ucn escape sequences in literal, if they
> wanted too.
> And again I'm trying to be pragmatic here. The work IBM is doing to get
> clang to support ebcdic is converting that ebcdic to utf-8.
>
>
>>
>> > All characters that a C++ compiler ever have, does or will care
>> about has a mapping to Unicode.
>>
>> But possibly not a unique mapping from Unicode back to the original
>> character,
>> which seems useful for transparent string-literal pass-through.
>>
>> Tom, I think the question whether there should be allowance for
>> pass-through of
>> characters beyond Unicode should be up for a straw poll at the next
>> telecon so
>> that we can make progress here.
>>
>
> Again, for the record, my position is "should be allowed, not mandated"
> But i agree polling might help
>
>
>>
>> > - Define "basic source character set" as a subset of the "source
>> character set"
>> > with an explicit list of Unicode characters.
>> >
>> >
>> > There is no need for that construct -
>>
>> These are the use-cases for the term "basic source character set":
>>
>> - keywords are spelled in the basic source character set
>>
>> - basic source characters can be represented in a single byte in plain
>> "char" literals
>
>
>> - UCNs denoting characters in the basic source character set are
>> ill-formed
>> [lex.charset] p2
>>
>> - Timezone parsing (the table in [time.parse], flag %Z)
>>
>> - do_widen / do_narrow [locale.ctype.virtuals]
>
> So, the term seems to be useful as a descriptive tool when we're
>> intentionally
>> referring to a that subset.
>>
>
> I agree that there is a need for a term, but, many of these can be better
> described in term of, for example "basic literal character set" or more
> accurately "basic literal character repertoire"
>
>
>>
>> > I would actually prefer we don't try to define the execution character
>> set in terms of a basic one which is tied to the internal
>> > representation.
>>
>> I don't think the standard needs to talk about internal "representation",
>> understood as specific code point values, at all, so I don't see the
>> confusion
>> here.
>>
>
> We talk about source to refer to something that is not related to source
>
>
>> I don't think we need an execution character set per se, but it seems
>> worthwhile to
>> be able to say "for this particular small set of ASCII characters,
>> special constraints
>> for the literal encoding/representation apply".
>>
>
> Agreed.
> And I think we might need slightly different definitions for some of the
> points you cited (which i was aware of, i meant that these things would
> need to be described differently, not that removing the term would have no
> ripple effect. My goal is that literals are not defined in term of source
>
>
>>
>> > We need however to specify that the grammar uses characters from
>> basic latin in case anybody is confused.
>>
>> Agreed.
>>
>> > - Translation phase 1 is reduced to
>> >
>> > "Physical source file characters are mapped, in an
>> implementation-defined manner,
>> > to the <del>basic</del> source character set (introducing new-line
>> characters for
>> > end-of-line indicators) if necessary. The set of physical source
>> file characters
>> > accepted is implementation-defined."
>> >
>> >
>> > I think the "introducing new-line characters for end-of-line
>> indicators" is confusing for reversal in raw string literals
>>
>> I'm defining the new-line question as out-of-scope for this particular
>> endeavor.
>>
>> > - Add a new phase 4+ that translates UCNs everywhere except
>> > in raw string literals to (non-basic) source characters.
>> > (This is needed to retain the status quo behavior that a UCN
>> > cannot be formed by concatenating string literals.)
>> >
>> > Is there a value of not doing it for identifiers and string literals
>> explicitly ?
>>
>> "identifier" is ambiguous between phase 4 and phase 7 identifiers.
>> We can't translate UCNs in phase 4 (due to stringizing), but we want
>> a single spelling in phase 7 (so no confusion arises what goes into
>> linker symbols etc). Previously, the single spelling was "UCNs
>> everywhere"; now, the single spelling is "(extended) characters
>> everywhere".
>>
>
> Right, i guess identifiers can be handled in phase 4+ or 7-
> Which mean that pp-identifiers and pp-token can be composed of both ucns
> escape sequences and xid_start/xid_continue code points
>
>
>>
>> For string and char literals, it simplifies the treatment a bit,
>> because we only have to discuss an abstract "source character"
>> instead of branching off into "oh, and we're mapping UCNs here"
>> every so often.
>>
>> Hm... I'm wondering whether ## token concatenation can form a UCN,
>> e.g. via bla\ ## u ## 0301 . We should make that ill-formed
>> or somehow prevent interpretation as a UCN.
>>
>
> #define CONCAT(x,y) x##y
> CONCAT(\, U0001F431);
>
> Is valid in all implementations I tested, implementation-defined in the
> standard.
> Do you see a reason to not allow it? in particular, as we move ucns
> handling later
> in the process, it would make sense to allow these escape sequences to be
> created in phase 2 and 4 (might be evolutionary, there is a paper)
>
>
>
>> >> For example, in string literals, we want to allow Latin-1 encoding of
>> umlauts expressed as a Unicode base vowel plus combining mark, if an
>> implementation so chooses.
>> >
>> > I think people in the mailing list agreed that individual c-char should
>> be encoded independently (i thought that was your opinion too?), which I
>> have come to agree with.
>>
>> Fine with me.
>>
>> > - In phase 5, we should go to "literal encoding" right away:
>> > There is no point in discussing a "character set" here; all
>> > we're interested in is a (sequence of) integer values that end
>> > up in the execution-time scalar value or array object corresponding
>> > to the source-code literal.
>> >
>> >
>> > Yep, agreed, as long as you find a way to describe that encoding
>> preserves semantic
>>
>> What semantic? A string literal consists of a sequence of source
>> characters.
>> At the end, we get a sequence of integer values in an array object.
>> We can certainly weave a "corresponding" into the process, but that's
>> essentially vacuous handwaving from a normative standpoint.
>> (The preceding statement only applies to char and wchar_t encodings,
>> of course, not to the well-defined UTF-x encodings.)
>>
>
> In particular, an implementation can do any conversion it wants, including
> replacing characters that have no representation by another,
> currently implementation defined behavior, which is something i would like
> to make ill-formed, i know it's evolutionary, there is a paper.
>
>
>>
>> Jens
>>
>

Received on 2020-06-20 08:20:01