On Tue, Sep 29, 2020 at 6:28 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Wed, 30 Sep 2020 at 00:00, Hubert Tong via SG16 <sg16@lists.isocpp.org> wrote:
In general, lexing stability is not guaranteed for C++. For example, we added <=>.

This means that the following program changes in behaviour between C++17 and C++20:
template <auto> int foo(...);

template <typename T>
auto bar(T &t) -> char (*)[0 ? sizeof(foo<&T::operator<=> != nullptr > (t)) : 42];

template <typename T>
void bar(const T &t);

struct A {
  bool operator<=(const A &);
  friend decltype(nullptr) operator>(decltype(nullptr), const A &);
} a;

int main(void) { bar(a); }

This was, however, a change made by the C++ committee and not by an external group.

I am concerned if syntactically significant properties of characters may change between versions of UCS.
The example that comes to mind is the possibility of a line separator character being added to UCS that would change where a C++-style // comment ends.

I think the intent at this point is that the set of line breaking unicode code point sequences after phase 1 should be explicitly listed by the wording, this is a small list which has not changed in the 90s afaict.
So even if new characters of the sort were to be introduced in unicode, it would not impact C++ without us noticing.
Sounds good.

The only thing that is currently fully handled by unicode is what is considered an identifier (as of P1945). 

Which is not an issue as the property is stable.
Non identifiers can become identifiers, not the other way around. So a program which might compile with a compiler with a unicode 13 database might not compile with one with a unicode 11 database for example, which 
is a reasonable limitation.
And the non-identifiers that do become identifiers were ill-formed following P1945 before becoming identifiers.

In places where the lexing does not look at properties (aka everything except identifiers as everything else is listed explicitly) whether a code point is assigned or not is not meaningful to the semantic of the program.

However,  by definition:

- In phase 1,  a source character cannot be mapped to an unassigned code point in a non-hostile implementation
- In phase 5,  non assigned code points cannot be represented in the execution (or wide execution) encodings if that execution encoding does not encode unicode code points.

Does that make sense ?

Perhaps this is motivation to ban source code with unassigned characters outside of string and character literals.
Maybe I'm just late to the party?

-- HT
SG16 mailing list