sg16: Re: [SG16-Unicode] NL 029 : Disallow zero-width and control characters

From: JF Bastien <cxx_at_[hidden]>
Date: Sat, 26 Oct 2019 16:32:07 -0700

On Sat, Oct 26, 2019 at 4:29 PM Steve Downey <sdowney_at_[hidden]> wrote:

> Building a static checker wouldn't be that hard, and would mean the
> compiler doesn't need to have a deep understanding of the unicode database.
>
> That's a reason to not use emoji, too, unfortunately.
>

I don't particularly care which tool addresses security concerns, I'm just
saying that IMO a tool should do so before the committee considers anything.

On Sat, Oct 26, 2019, 11:37 JF Bastien <cxx_at_[hidden]> wrote:
>
>>
>>
>> On Fri, Oct 25, 2019 at 7:07 AM Steve Downey <sdowney_at_[hidden]> wrote:
>>
>>> We also should consider Unicode Technical Report #36 UNICODE SECURITY
>>> CONSIDERATIONS. Although my first thought was that allowing confusing
>>> characters in an identifier is just a developer causing problems for
>>> themselves, it is actually a problem in code review. Using punning names to
>>> disguise that a local `i` is not shadowing an outer scope `i` and using
>>> that to inject an exploitable buffer attack, for example. If I spend some
>>> black hat time, I could probably craft something even "better".
>>>
>>
>> IMO: The above security considerations seem like something that compiler
>> diagnostics should try out non-normatively before we try to standardize
>> anything.
>>
>> I think TR31 should be some in the standard. One goal is that all
>> compilers end up supporting Unicode the same way. The current situation is
>> pretty silly.
>>
>>
>> TR31 has some discussion on normalization. I think canonicalization is
>>> probably the right thing to do, as anything else leads to tools lying to
>>> you without intending to. It should not matter how my editor decides to
>>> craft a letter with a diacritic, even if the source code takes a round trip
>>> through some rich text or word processor. This is an implementation burden,
>>> though. Really anything beyond the current white list is, in any case.
>>>
>>> We'd also probably need to clarify that this means an even stronger
>>> requirement on the internal representation of source code. The input text
>>> has to be converted into code points (universal character names) and all of
>>> the operations we are talking about apply to that representation.
>>> Representation of the code points is implementation specific.
>>>
>>> I'm going to end up writing this paper, aren't I.
>>>
>>>
>>>
>>> On Fri, Oct 25, 2019 at 3:10 AM Corentin <corentin.jabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, 25 Oct 2019 at 08:58, Corentin <corentin.jabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 25, 2019, 02:18 Zach Laine <whatwasthataddress_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Is this a real problem that is biting people right now? Are people
>>>>>> using these characters in identifiers and causing great upheaval? This
>>>>>> seems of the lowest possible priority to me, and not at all C++20-related.
>>>>>>
>>>>>
>>>>>
>>>>> Completely agree, with both of you.
>>>>> I would be deeply unsatisfied with a solution that would:
>>>>>
>>>>> * Not follow TR31 recommandations
>>>>> * Not address the fact that you can only have Unicode identifiers if
>>>>> the compiler knows that your file id
>>>>>
>>>>
>>>> I sent the previous mail too fast, sorry about the noise.
>>>> As I was saying
>>>>
>>>> Completely agree, with both of you.
>>>> I would be deeply unsatisfied with a solution that would:
>>>>
>>>> * Not follow TR31 recommendations
>>>> * Not address the fact that you can only have Unicode identifiers if
>>>> the compiler knows that your file is UTF encoded (same issue that u8
>>>> literals, we talked about that - P1880)
>>>> * Fail to address concerns related to mangling if a normalization form
>>>> is not specified
>>>> * Fail to recognize that we will want to reflect on the name of these
>>>> things (std::meta::name_of) and that would require reflection to be able to
>>>> deal with that, both in terms of providing a uf8 api AND a specified
>>>> normalization form
>>>>
>>>> All of that require careful consideration
>>>>
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>> Zach
>>>>>>
>>>>>> On Thu, Oct 24, 2019 at 5:25 PM Steve Downey <sdowney_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> SG16 has an NB comment to deal with! Tom has already scheduled it
>>>>>>> for Belfast. It's basically that the list of allowed code points have some
>>>>>>> interesting control characters like zero width joiners and RTL modifiers.
>>>>>>>
>>>>>>> https://github.com/cplusplus/nbballot/issues/28
>>>>>>>
>>>>>>> There's also an issue that JF raised earlier:
>>>>>>> https://github.com/sg16-unicode/sg16/issues/48
>>>>>>> Improve support for Unicode characters in identifiers
>>>>>>>
>>>>>>> Relevant unicode standard:
>>>>>>> https://unicode.org/reports/tr31/ UNICODE IDENTIFIER AND PATTERN
>>>>>>> SYNTAX
>>>>>>>
>>>>>>> Which is complicated because it allows things like identifiers
>>>>>>> written in Farsi which requires zwj for disambiguation, and suggests regex
>>>>>>> to detect particular allowed identifiers. It's fairly dense, and I haven't
>>>>>>> digested it yet, but it looks like there might be allowed ways to exclude
>>>>>>> that.
>>>>>>>
>>>>>>> Plus tailoring would be needed because C++ disallows some characters
>>>>>>> such as '$' which might otherwise be allowed. This is also discussed in
>>>>>>> TR31.
>>>>>>>
>>>>>>>
>>>>>>> My feeling on the comment is that it's not a new issue for C++20, so
>>>>>>> it's not clear that it has to be fixed for C++20. I believe it should be
>>>>>>> fixed, but it ought to be fixed in a principled manner, and that likely
>>>>>>> means TR31.
>>>>>>>
>>>>>>> We would also have to discuss if emoji are allowed in identifiers.
>>>>>>> TR31 does not strictly disallow them. The TonyTable shall be interesting.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> SG16 Unicode mailing list
>>>>>>> Unicode_at_[hidden]
>>>>>>> http://www.open-std.org/mailman/listinfo/unicode
>>>>>>>
>>>>>> _______________________________________________
>>>>>> SG16 Unicode mailing list
>>>>>> Unicode_at_[hidden]
>>>>>> http://www.open-std.org/mailman/listinfo/unicode
>>>>>>
>>>>> _______________________________________________
>>> SG16 Unicode mailing list
>>> Unicode_at_[hidden]
>>> http://www.open-std.org/mailman/listinfo/unicode
>>>
>>

Received on 2019-10-27 01:32:21