sg16: Re: [SG16] Stability of lexing for a fixed C++ version when combined with updates to UCS

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 29 Sep 2020 18:39:19 -0400

On Tue, Sep 29, 2020 at 6:28 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Wed, 30 Sep 2020 at 00:00, Hubert Tong via SG16 <sg16_at_[hidden]>
> wrote:
>
>> In general, lexing stability is not guaranteed for C++. For example, we
>> added <=>.
>>
>> This means that the following program changes in behaviour between C++17
>> and C++20:
>> template <auto> int foo(...);
>>
>> template <typename T>
>> auto bar(T &t) -> char (*)[0 ? sizeof(foo<&T::operator<=> != nullptr >
>> (t)) : 42];
>>
>> template <typename T>
>> void bar(const T &t);
>>
>> struct A {
>> bool operator<=(const A &);
>> friend decltype(nullptr) operator>(decltype(nullptr), const A &);
>> } a;
>>
>> int main(void) { bar(a); }
>>
>> This was, however, a change made by the C++ committee and not by an
>> external group.
>>
>> I am concerned if syntactically significant properties of characters may
>> change between versions of UCS.
>> The example that comes to mind is the possibility of a line separator
>> character being added to UCS that would change where a C++-style //
>> comment ends.
>>
>
> I think the intent at this point is that the set of line breaking unicode
> code point sequences after phase 1 should be explicitly listed by the
> wording, this is a small list which has not changed in the 90s afaict.
> So even if new characters of the sort were to be introduced in unicode, it
> would not impact C++ without us noticing.
>
Sounds good.

>
> The only thing that is currently fully handled by unicode is what is
> considered an identifier (as of P1945).
>
> Which is not an issue as the property is stable.
> Non identifiers can become identifiers, not the other way around. So a
> program which might compile with a compiler with a unicode 13 database
> might not compile with one with a unicode 11 database for example, which
> is a reasonable limitation.
>
And the non-identifiers that do become identifiers were ill-formed
following P1945 before becoming identifiers.

>
> In places where the lexing does not look at properties (aka everything
> except identifiers as everything else is listed explicitly) whether a code
> point is assigned or not is not meaningful to the semantic of the program.
>
> However, by definition:
>
> - In phase 1, a source character cannot be mapped to an unassigned code
> point in a non-hostile implementation
> - In phase 5, non assigned code points cannot be represented in the
> execution (or wide execution) encodings if that execution encoding does not
> encode unicode code points.
>
> Does that make sense ?
>
Yes.

>
>
>
>> Perhaps this is motivation to ban source code with unassigned characters
>> outside of string and character literals.
>> Maybe I'm just late to the party?
>>
>> -- HT
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2020-09-29 17:39:39