C++ Logo


Advanced search

Re: P2749 (down with "character") and P2736 (Referencing the Unicode Standard) updates

From: Fraser Gordon <fraserjgordon+cpp_at_[hidden]>
Date: Fri, 27 Jan 2023 08:34:26 -0500
Thanks Steve & Corentin for your detailed replies. I'll try to ask shorter
questions in future :)

Corentin - you've definitely convinced me on the "codepoint" vs "abstract
character" front. The answer to my question is that yes, it's a deliberate
decision. And the explanation you've given contains the answers if anyone
does ask if the wording change causes differences in behaviour. Would it be
worth summarising it in the paper?

The only thing I'm not 100% sure on is [diff.cpp14.lex] might now be
implying that trigraphs are disallowed in UTF-8 source files while that
implication wasn't there before (because that UTF-8 -> Unicode mapping
isn't implementation-defined but UTF-8 -> "translation character set" was).
As long as you're happy it doesn't say that, I'll stop asking about it
(like most people, I've only ever "used" trigraphs accidentally).

Corentin & Steve - I knew there'd be reasons but I couldn't think what they
were: the lexer is scary and full of unintended consequences.


On Fri, 27 Jan 2023 at 04:32, Corentin <corentin.jabot_at_[hidden]> wrote:

> On Fri, Jan 27, 2023 at 4:05 AM Fraser Gordon <fraserjgordon+cpp_at_[hidden]>
> wrote:
>> Hi Corentin,
>> Thanks for your work on these papers. The multiple meanings of
>> 'character' (not just in C++) is something that irritates me so I'm very
>> happy to see it removed.
>> I started replying to the paper with some minor points but then I started
>> to overthink things. I've left it all here in case it's useful but the bits
>> that are directly related to the paper are the first set of bullets :)
>> ---
>> I'm going through D2749 and have some comments/questions:
>> - "Unicode code point" feels awkward to read (and write!). Is it
>> worth defining a term so it can be shortened? In [lex.ext] you've used
>> "code point" without "Unicode".
>> Code point is generally applicable to any encoding, not just to unicode.
> When there would not be possible confusion, I did try to not say Unicode
> redundantly. At the same time, I want to make sure that we can keep talking
> about a shift-jis code point, for example, especially in the library
> section.
> Generally, the wording is jungling with around five different encodings,
> which even SG-16 members have historically been struggling to tell apart.
> Being exact is important here.
> I think these changes look a bit dense in isolation but if you were to
> read the whole standard with these changes applied it would not be
> insufferable.
>> - [lex.charset]: Table 1 (basic character set) retains "character",
>> as does the text referring to it, but Table 3 (additional control
>> characters in the basic literal character set) replaces "character" with
>> "Unicode code point". The text referencing Table 3 describes the contents
>> of Table 1 as "abstract characters" and those of Table 3 as "control
>> characters". The text also contains a note saying that the u+xxxx values
>> don't have a meaning in this context. Based on that note and the normative
>> text, I think the column heading on both tables and the references in the
>> text should be "abstract character" consistently.
>> Yes, that change to Table 1 should be reverted for consistency, thanks
>> - as an alternative to "abstract character" above, I think "Unicode
>> code point" would also work if used everywhere due to all of the abstract
>> characters in the basic set mapping to a single codepoint. It would become
>> incorrect in the (unlikely?) event that the basic set expands to add
>> abstract characters that are represented by combining sequences.
>> We could. In fact, I started doing that, but found it to be maybe too
> pedantic and too churny, as there is no ambiguity. But we could go further
> editorially after, if we want to.
> For example, everytime we say "a quotation character" we could say
> "U+0022 quotation mark".
>> - [diff.cpp14.lex] I'm not sure this section makes much sense now. It
>> ends (with your changes) "as part of the implementation-defined mapping
>> from input source file characters to Unicode." But is that still
>> implementation-defined? With the removal of 'translation character set' and
>> with the requirement to accept UTF-8 input files, the conversion from UTF-8
>> 'characters' to Unicode is defined by the Unicode standard. Instead of
>> replacing "the translation character set" with "Unicode", would it be
>> better to strike the whole "as part of the..." clause from that sentence?
> Yes, it is. There are two scenarios now in phase 1. Either the source file
> is UTF-8, in which case you are just decoding that to codepoints. Or it is
> anything else (including ebcdic, morse, an ancient cuneiform tablet), and
> reading that source file requires converting it to unicode code points,
> somehow, which is left entirely implementation-defined.
> Whether that conversion is to Unicode or the translation set (which is
> isomorphic to Unicode) makes no difference
>> ---
>> Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of
>> usage of "Unicode code point" vs "abstract character" (or "Unicode abstract
>> character") in some places. The code point phrasing is definitely
>> appropriate where we're talking about e.g. named escape sequences and it's
>> probably how everything will be implemented anyway, because the requirement
>> that everything be in Normal Form C means each abstract character has a 1:1
>> mapping with a combining sequence (I think?).
>> As an example, the 2nd paragraph of [lex.pptoken] says:
>>> [...] If any characters Unicode code point not in the basic character
>>> set matches the last category, the program is ill-formed.
>> I think it'd be more correct to use "abstract character" rather than
>> "code point" here because even though the mechanism will be codepoints, the
>> underlying thing being talked about is the abstract character.
> Lexing is completely unaware of abstract characters, which could be
> composed of multiple codepoints for example.
> We read the next 21 bits value in a sequence of 21 bits value.
> Abstract characters in general do not make sense in computer programs, and
> are only useful when mapping between 2 different encodings (both Unicode
> and EBCDIC have a codepoint that represents the letter 'a'), or when
> talking about characters in the abstract.
>> Or, to put it another way: what is "and͓"?
>> As specified & implemented, it's an identifier formed from 4 codepoints.
>> So the mechanistic answer is that it's an identifier in the same way that
>> "andd" is; the keyword "and" just happens to be a prefix.
>> But for a human looking at it, they'd say either "it's different because
>> 'd' and 'd͓' are qualitatively different (i.e they're different abstract
>> characters); the shared prefix with 'and' is 'an'" or "it's the same as
>> 'and'; the ◌͓ mark is not significant". (We've clearly decided combining
>> marks are significant in C++ so we can ignore the latter interpretation.)
>> Having written this out, I'm mostly convinced that the mechanistic
>> (codepoint) and human (abstract character) interpretations will produce the
>> same results (I'd be interested in any ideas for counter-examples!) so I
>> don't think there would be a behavioural change with either. I still think
>> the intent matters though.
>> This may have all been discussed before (if so, apologies for making you
>> read all of that) and it was decided that codepoint was the better term
>> and/or that we definitely want to specify in terms of codepoints because of
>> implementation constraints (or even that abstract characters are just too
>> hard to nail down in meaningful stanardese). It's hard to tell though
>> because the current wording uses "character" everywhere and by changing to
>> "Unicode code point", we're coming down on one side of that fence and I
>> want to be sure that's a deliberate decision.
> Where it matters is that we don't want implementation to be that clever .
> Lexing does (and always have, we are just clarifying) look at the next
> codepoint. Looking at the next abstract character is not possible, because
> there is no specification that can describe what an abstract character is
> to a machine, it's a very human-centric concept (You and I know what a 'U'
> is, or pretend to at least, because we could have a very long
> philosophical debate about whether 'U' actually is 'U' depending on
> historical context - but the compiler only knows that U+0055 has
> XID_Start=true). We could look at the next grapheme, which is a Unicode
> specification for codepoint sequences representing whole abstract
> characters, but 1/unicode does not specify identifiers that way 2/we don't
> want to force an implementation to look at grapheme boundaries (because
> they have not historically been, because it would force implementation to
> do extra lookup during lexing, which would have performance implications,
> because it would make the behavior of lexing dependant on unicode versions,
> none of which are desirable properties of a lexer), and because there is no
> clear benefit of behaving that way as it would be mostly unobservable.
> One could argue that it would produce better diagnostics, but then again,
> whether a grapheme would be a valid identifier does, as you note, depend
> not on any property of that grapheme but on the normalization form the
> source file is in. We also found that diagnostics are more actionable when
> they mention codepoints, because of invisible glyphs, normalization form
> and so forth.
> Another way to look at it is that, except for the purpose of extracting
> graphemes, Unicode never considers graphemes, all behaviors are specified
> on codepoints, and the usefulness of graphemes is mostly limited to text
> processing and counting the length of tweets.
>> ---
>> Because I've never actually read the lexing bits of the standard closely
>> before there's a few things (completely unrelated to this paper) I noticed
>> and found surprising:
>> - [lex.ccon]/[lex.string]: *basic-c-char* allows any codepoint except
>> apostrophe, backslash and newlines. Lots of interesting control and
>> presentation chars (e.g. bidi overrides) are allowed! Ditto for
>> * basic-s-char* but less surprising in string literals than character
>> literals.
>> - [lex.charset]: *n-char* is similar to the above, despite the
>> restricted alphabet that Unicode uses to name characters. I assume this is
>> for implementation simplicity?
>> That was changed fairly recently, see
> https://cplusplus.github.io/CWG/issues/2640.html
> I certainly do not think it made the implementation easier, but it changes
> when diagnostics are produced when these things appear in macros.
>> - [lex.ccon]: *hexadecimal-escape-sequence* looks wrong. Raised as a CWG
>> issue <https://cplusplus.github.io/CWG/issues/2691.html>.
>> Thanks
>> ---
>> This was meant to be a short email when I started writing it :/
>> Fraser
>> On Thu, 26 Jan 2023 at 04:42, Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>> Hey folks:
>>> I published https://isocpp.org/files/papers/P2736R1.pdf - with the
>>> changes requested by SG16 yesterday as part of the forwarding poll.
>>> Interesting change were
>>> * to constantly mention UAX XX of The Unicode Standard
>>> * __STDC_ISO_10646__ : An integer literal of the form yyyymmL (for
>>> example, 199712L). If this symbol is defined, then its value is
>>> implementation-defined
>>> * Specifically mention UTF-8, UTF-16 and UTF-32 instead of Unicode
>>> encoding
>>> https://isocpp.org/files/papers/D2749R0.pdf down with character:
>>> - Remove the footnote about old linkers
>>> - Apply the character -> codepoint changes to the annexes and [diff]
>>> sections
>>> - remove a stale cross reference in phase 1 of translations
>>> - various typos
>>> Thanks,
>>> Corentin
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-01-27 13:34:39