sg16: Re: [SG16-Unicode] NL 029 : Disallow zero-width and control characters

From: Steve Downey <sdowney_at_[hidden]>
Date: Sat, 26 Oct 2019 19:29:25 -0400

Building a static checker wouldn't be that hard, and would mean the
compiler doesn't need to have a deep understanding of the unicode database.

That's a reason to not use emoji, too, unfortunately.

On Sat, Oct 26, 2019, 11:37 JF Bastien <cxx_at_[hidden]> wrote:

>
>
> On Fri, Oct 25, 2019 at 7:07 AM Steve Downey <sdowney_at_[hidden]> wrote:
>
>> We also should consider Unicode Technical Report #36 UNICODE SECURITY
>> CONSIDERATIONS. Although my first thought was that allowing confusing
>> characters in an identifier is just a developer causing problems for
>> themselves, it is actually a problem in code review. Using punning names to
>> disguise that a local `i` is not shadowing an outer scope `i` and using
>> that to inject an exploitable buffer attack, for example. If I spend some
>> black hat time, I could probably craft something even "better".
>>
>
> IMO: The above security considerations seem like something that compiler
> diagnostics should try out non-normatively before we try to standardize
> anything.
>
> I think TR31 should be some in the standard. One goal is that all
> compilers end up supporting Unicode the same way. The current situation is
> pretty silly.
>
>
> TR31 has some discussion on normalization. I think canonicalization is
>> probably the right thing to do, as anything else leads to tools lying to
>> you without intending to. It should not matter how my editor decides to
>> craft a letter with a diacritic, even if the source code takes a round trip
>> through some rich text or word processor. This is an implementation burden,
>> though. Really anything beyond the current white list is, in any case.
>>
>> We'd also probably need to clarify that this means an even stronger
>> requirement on the internal representation of source code. The input text
>> has to be converted into code points (universal character names) and all of
>> the operations we are talking about apply to that representation.
>> Representation of the code points is implementation specific.
>>
>> I'm going to end up writing this paper, aren't I.
>>
>>
>>
>> On Fri, Oct 25, 2019 at 3:10 AM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Fri, 25 Oct 2019 at 08:58, Corentin <corentin.jabot_at_[hidden]> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Oct 25, 2019, 02:18 Zach Laine <whatwasthataddress_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Is this a real problem that is biting people right now? Are people
>>>>> using these characters in identifiers and causing great upheaval? This
>>>>> seems of the lowest possible priority to me, and not at all C++20-related.
>>>>>
>>>>
>>>>
>>>> Completely agree, with both of you.
>>>> I would be deeply unsatisfied with a solution that would:
>>>>
>>>> * Not follow TR31 recommandations
>>>> * Not address the fact that you can only have Unicode identifiers if
>>>> the compiler knows that your file id
>>>>
>>>
>>> I sent the previous mail too fast, sorry about the noise.
>>> As I was saying
>>>
>>> Completely agree, with both of you.
>>> I would be deeply unsatisfied with a solution that would:
>>>
>>> * Not follow TR31 recommendations
>>> * Not address the fact that you can only have Unicode identifiers if the
>>> compiler knows that your file is UTF encoded (same issue that u8 literals,
>>> we talked about that - P1880)
>>> * Fail to address concerns related to mangling if a normalization form
>>> is not specified
>>> * Fail to recognize that we will want to reflect on the name of these
>>> things (std::meta::name_of) and that would require reflection to be able to
>>> deal with that, both in terms of providing a uf8 api AND a specified
>>> normalization form
>>>
>>> All of that require careful consideration
>>>
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>>
>>>>> Zach
>>>>>
>>>>> On Thu, Oct 24, 2019 at 5:25 PM Steve Downey <sdowney_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> SG16 has an NB comment to deal with! Tom has already scheduled it for
>>>>>> Belfast. It's basically that the list of allowed code points have some
>>>>>> interesting control characters like zero width joiners and RTL modifiers.
>>>>>>
>>>>>> https://github.com/cplusplus/nbballot/issues/28
>>>>>>
>>>>>> There's also an issue that JF raised earlier:
>>>>>> https://github.com/sg16-unicode/sg16/issues/48
>>>>>> Improve support for Unicode characters in identifiers
>>>>>>
>>>>>> Relevant unicode standard:
>>>>>> https://unicode.org/reports/tr31/ UNICODE IDENTIFIER AND PATTERN
>>>>>> SYNTAX
>>>>>>
>>>>>> Which is complicated because it allows things like identifiers
>>>>>> written in Farsi which requires zwj for disambiguation, and suggests regex
>>>>>> to detect particular allowed identifiers. It's fairly dense, and I haven't
>>>>>> digested it yet, but it looks like there might be allowed ways to exclude
>>>>>> that.
>>>>>>
>>>>>> Plus tailoring would be needed because C++ disallows some characters
>>>>>> such as '$' which might otherwise be allowed. This is also discussed in
>>>>>> TR31.
>>>>>>
>>>>>>
>>>>>> My feeling on the comment is that it's not a new issue for C++20, so
>>>>>> it's not clear that it has to be fixed for C++20. I believe it should be
>>>>>> fixed, but it ought to be fixed in a principled manner, and that likely
>>>>>> means TR31.
>>>>>>
>>>>>> We would also have to discuss if emoji are allowed in identifiers.
>>>>>> TR31 does not strictly disallow them. The TonyTable shall be interesting.
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> SG16 Unicode mailing list
>>>>>> Unicode_at_[hidden]
>>>>>> http://www.open-std.org/mailman/listinfo/unicode
>>>>>>
>>>>> _______________________________________________
>>>>> SG16 Unicode mailing list
>>>>> Unicode_at_[hidden]
>>>>> http://www.open-std.org/mailman/listinfo/unicode
>>>>>
>>>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
>

Received on 2019-10-27 01:29:40