C++ Logo

sg16

Advanced search

Re: P2749 (down with "character") and P2736 (Referencing the Unicode Standard) updates

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 27 Jan 2023 10:32:41 +0100
On Fri, Jan 27, 2023 at 4:05 AM Fraser Gordon <fraserjgordon+cpp_at_[hidden]>
wrote:

> Hi Corentin,
>
> Thanks for your work on these papers. The multiple meanings of 'character'
> (not just in C++) is something that irritates me so I'm very happy to see
> it removed.
>
> I started replying to the paper with some minor points but then I started
> to overthink things. I've left it all here in case it's useful but the bits
> that are directly related to the paper are the first set of bullets :)
>
> ---
>
> I'm going through D2749 and have some comments/questions:
>
> - "Unicode code point" feels awkward to read (and write!). Is it worth
> defining a term so it can be shortened? In [lex.ext] you've used "code
> point" without "Unicode".
>
> Code point is generally applicable to any encoding, not just to unicode.
When there would not be possible confusion, I did try to not say Unicode
redundantly. At the same time, I want to make sure that we can keep talking
about a shift-jis code point, for example, especially in the library
section.

Generally, the wording is jungling with around five different encodings,
which even SG-16 members have historically been struggling to tell apart.
Being exact is important here.
I think these changes look a bit dense in isolation but if you were to read
the whole standard with these changes applied it would not be insufferable.


>
> - [lex.charset]: Table 1 (basic character set) retains "character", as
> does the text referring to it, but Table 3 (additional control characters
> in the basic literal character set) replaces "character" with "Unicode code
> point". The text referencing Table 3 describes the contents of Table 1 as
> "abstract characters" and those of Table 3 as "control characters". The
> text also contains a note saying that the u+xxxx values don't have a
> meaning in this context. Based on that note and the normative text, I think
> the column heading on both tables and the references in the text should be
> "abstract character" consistently.
>
> Yes, that change to Table 1 should be reverted for consistency, thanks

>
> - as an alternative to "abstract character" above, I think "Unicode
> code point" would also work if used everywhere due to all of the abstract
> characters in the basic set mapping to a single codepoint. It would become
> incorrect in the (unlikely?) event that the basic set expands to add
> abstract characters that are represented by combining sequences.
>
> We could. In fact, I started doing that, but found it to be maybe too
pedantic and too churny, as there is no ambiguity. But we could go further
editorially after, if we want to.
For example, everytime we say "a quotation character" we could say
"U+0022 quotation mark".


>
> - [diff.cpp14.lex] I'm not sure this section makes much sense now. It
> ends (with your changes) "as part of the implementation-defined mapping
> from input source file characters to Unicode." But is that still
> implementation-defined? With the removal of 'translation character set' and
> with the requirement to accept UTF-8 input files, the conversion from UTF-8
> 'characters' to Unicode is defined by the Unicode standard. Instead of
> replacing "the translation character set" with "Unicode", would it be
> better to strike the whole "as part of the..." clause from that sentence?
>
>
Yes, it is. There are two scenarios now in phase 1. Either the source file
is UTF-8, in which case you are just decoding that to codepoints. Or it is
anything else (including ebcdic, morse, an ancient cuneiform tablet), and
reading that source file requires converting it to unicode code points,
somehow, which is left entirely implementation-defined.
Whether that conversion is to Unicode or the translation set (which is
isomorphic to Unicode) makes no difference


> ---
>
> Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of
> usage of "Unicode code point" vs "abstract character" (or "Unicode abstract
> character") in some places. The code point phrasing is definitely
> appropriate where we're talking about e.g. named escape sequences and it's
> probably how everything will be implemented anyway, because the requirement
> that everything be in Normal Form C means each abstract character has a 1:1
> mapping with a combining sequence (I think?).
>
> As an example, the 2nd paragraph of [lex.pptoken] says:
>
>> [...] If any characters Unicode code point not in the basic character
>> set matches the last category, the program is ill-formed.
>
>
> I think it'd be more correct to use "abstract character" rather than "code
> point" here because even though the mechanism will be codepoints, the
> underlying thing being talked about is the abstract character.
>

Lexing is completely unaware of abstract characters, which could be
composed of multiple codepoints for example.
We read the next 21 bits value in a sequence of 21 bits value.
Abstract characters in general do not make sense in computer programs, and
are only useful when mapping between 2 different encodings (both Unicode
and EBCDIC have a codepoint that represents the letter 'a'), or when
talking about characters in the abstract.


>
> Or, to put it another way: what is "and͓"?
>
> As specified & implemented, it's an identifier formed from 4 codepoints.
> So the mechanistic answer is that it's an identifier in the same way that
> "andd" is; the keyword "and" just happens to be a prefix.
>
> But for a human looking at it, they'd say either "it's different because
> 'd' and 'd͓' are qualitatively different (i.e they're different abstract
> characters); the shared prefix with 'and' is 'an'" or "it's the same as
> 'and'; the ◌͓ mark is not significant". (We've clearly decided combining
> marks are significant in C++ so we can ignore the latter interpretation.)
>
> Having written this out, I'm mostly convinced that the mechanistic
> (codepoint) and human (abstract character) interpretations will produce the
> same results (I'd be interested in any ideas for counter-examples!) so I
> don't think there would be a behavioural change with either. I still think
> the intent matters though.
>
> This may have all been discussed before (if so, apologies for making you
> read all of that) and it was decided that codepoint was the better term
> and/or that we definitely want to specify in terms of codepoints because of
> implementation constraints (or even that abstract characters are just too
> hard to nail down in meaningful stanardese). It's hard to tell though
> because the current wording uses "character" everywhere and by changing to
> "Unicode code point", we're coming down on one side of that fence and I
> want to be sure that's a deliberate decision.
>

Where it matters is that we don't want implementation to be that clever .

Lexing does (and always have, we are just clarifying) look at the next
codepoint. Looking at the next abstract character is not possible, because
there is no specification that can describe what an abstract character is
to a machine, it's a very human-centric concept (You and I know what a 'U'
is, or pretend to at least, because we could have a very long
philosophical debate about whether 'U' actually is 'U' depending on
historical context - but the compiler only knows that U+0055 has
XID_Start=true). We could look at the next grapheme, which is a Unicode
specification for codepoint sequences representing whole abstract
characters, but 1/unicode does not specify identifiers that way 2/we don't
want to force an implementation to look at grapheme boundaries (because
they have not historically been, because it would force implementation to
do extra lookup during lexing, which would have performance implications,
because it would make the behavior of lexing dependant on unicode versions,
none of which are desirable properties of a lexer), and because there is no
clear benefit of behaving that way as it would be mostly unobservable.
One could argue that it would produce better diagnostics, but then again,
whether a grapheme would be a valid identifier does, as you note, depend
not on any property of that grapheme but on the normalization form the
source file is in. We also found that diagnostics are more actionable when
they mention codepoints, because of invisible glyphs, normalization form
and so forth.

Another way to look at it is that, except for the purpose of extracting
graphemes, Unicode never considers graphemes, all behaviors are specified
on codepoints, and the usefulness of graphemes is mostly limited to text
processing and counting the length of tweets.



>
> ---
>
> Because I've never actually read the lexing bits of the standard closely
> before there's a few things (completely unrelated to this paper) I noticed
> and found surprising:
>
> - [lex.ccon]/[lex.string]: *basic-c-char* allows any codepoint except
> apostrophe, backslash and newlines. Lots of interesting control and
> presentation chars (e.g. bidi overrides) are allowed! Ditto for
> * basic-s-char* but less surprising in string literals than character
> literals.
>
>
> - [lex.charset]: *n-char* is similar to the above, despite the
> restricted alphabet that Unicode uses to name characters. I assume this is
> for implementation simplicity?
>
> That was changed fairly recently, see
https://cplusplus.github.io/CWG/issues/2640.html
I certainly do not think it made the implementation easier, but it changes
when diagnostics are produced when these things appear in macros.


>
> - [lex.ccon]: *hexadecimal-escape-sequence* looks wrong. Raised as a CWG
> issue <https://cplusplus.github.io/CWG/issues/2691.html>.
>
> Thanks

> ---
>
> This was meant to be a short email when I started writing it :/
>
> Fraser
>
> On Thu, 26 Jan 2023 at 04:42, Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Hey folks:
>>
>> I published https://isocpp.org/files/papers/P2736R1.pdf - with the
>> changes requested by SG16 yesterday as part of the forwarding poll.
>> Interesting change were
>>
>> * to constantly mention UAX XX of The Unicode Standard
>> * __STDC_ISO_10646__ : An integer literal of the form yyyymmL (for
>> example, 199712L). If this symbol is defined, then its value is
>> implementation-defined
>> * Specifically mention UTF-8, UTF-16 and UTF-32 instead of Unicode
>> encoding
>>
>> https://isocpp.org/files/papers/D2749R0.pdf down with character:
>>
>> - Remove the footnote about old linkers
>> - Apply the character -> codepoint changes to the annexes and [diff]
>> sections
>> - remove a stale cross reference in phase 1 of translations
>> - various typos
>>
>>
>> Thanks,
>>
>> Corentin
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2023-01-27 09:32:55