C++ Logo


Advanced search

Re: [SG16] Stability of lexing for a fixed C++ version when combined with updates to UCS

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 30 Sep 2020 00:28:23 +0200
On Wed, 30 Sep 2020 at 00:00, Hubert Tong via SG16 <sg16_at_[hidden]>

> In general, lexing stability is not guaranteed for C++. For example, we
> added <=>.
> This means that the following program changes in behaviour between C++17
> and C++20:
> template <auto> int foo(...);
> template <typename T>
> auto bar(T &t) -> char (*)[0 ? sizeof(foo<&T::operator<=> != nullptr >
> (t)) : 42];
> template <typename T>
> void bar(const T &t);
> struct A {
> bool operator<=(const A &);
> friend decltype(nullptr) operator>(decltype(nullptr), const A &);
> } a;
> int main(void) { bar(a); }
> This was, however, a change made by the C++ committee and not by an
> external group.
> I am concerned if syntactically significant properties of characters may
> change between versions of UCS.
> The example that comes to mind is the possibility of a line separator
> character being added to UCS that would change where a C++-style //
> comment ends.

I think the intent at this point is that the set of line breaking unicode
code point sequences after phase 1 should be explicitly listed by the
wording, this is a small list which has not changed in the 90s afaict.
So even if new characters of the sort were to be introduced in unicode, it
would not impact C++ without us noticing.

The only thing that is currently fully handled by unicode is what is
considered an identifier (as of P1945).

Which is not an issue as the property is stable.
Non identifiers can become identifiers, not the other way around. So a
program which might compile with a compiler with a unicode 13 database
might not compile with one with a unicode 11 database for example, which
is a reasonable limitation.

In places where the lexing does not look at properties (aka everything
except identifiers as everything else is listed explicitly) whether a code
point is assigned or not is not meaningful to the semantic of the program.

However, by definition:

- In phase 1, a source character cannot be mapped to an unassigned code
point in a non-hostile implementation
- In phase 5, non assigned code points cannot be represented in the
execution (or wide execution) encodings if that execution encoding does not
encode unicode code points.

Does that make sense ?

> Perhaps this is motivation to ban source code with unassigned characters
> outside of string and character literals.
> Maybe I'm just late to the party?
> -- HT
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-09-29 17:28:37