Subject: Re: Emojis in identifiers
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-23 09:27:32
On Tue, 23 Jun 2020 at 15:20, Marcos Bento via SG16 <sg16_at_[hidden]>
> On Mon, Jun 22, 2020 at 6:24 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>> On 6/19/20 11:07 AM, Steve Downey via SG16 wrote:
>> While we don't exclude scripts generally, by not doing script analysis,
>> the lack of ZWJ and ZWNJ makes some words in Indic scripts problematic. The
>> examples in
>> https://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters are
>> relevant. Zero Width Joiner and Zero Width Non-Joiner are used in
>> Farsi, Malayalam, and Sinhala.
>> Perhaps a revision of the paper can note the possibility of such script
>> analysis as a possible future direction?
>> Wikipedia https://en.wikipedia.org/wiki/Zero-width_joiner#Examples mentions
>> Devanagari and Kannada, although it appears that recent editions of Unicode
>> may have added explicit characters in Devanagari to alleviate the problem.
>> It isn't clear to me that we have a good list of scripts that are
>> (partially) excluded by P1949R4. Is it reasonable to identify that set and
>> include it in a revision? Perhaps noting that Unicode is evolving to
>> better handle them such that they will not be excluded in the future?
>> Script recognition would also be necessary to identify the "emoji" script
>> to allow sequences, as well as expanding the repertoire of allowed
>> characters to include the currently explicitly disallowed emoji, the ones
>> that were known at the time the allowed character ranges in C++ was put
>> And finally, perhaps the next revision can acknowledge this as a
>> possibility and attempt to qualify the technical impact? I suspect JF will
>> want to poll inclusion of emoji, so the more we can do to inform EWG on the
>> consequences of doing so, the better.
> Since the telecon, like Martinho, I've been closely following the
> discussion to better understand the points raised by the authors of the
> My main concern is not Emoji per se, but other relevant scripts that might
> fall through the cracks of UAX#31. I like the approach that anyone that
> wants Emoji should try and change UAX#31, but the limitations that are
> being suggested should be clear.
That is not what is suggested.
Emojis are not just codepoints, they are a complex grammar that cannot just
be "allowed", there needs to be a non trivial support. In fact the only
sane way to support that would be to only allow the "recommended for
general interchange" emojis from a specific list.
See http://unicode.org/reports/tr51/ for details.
But again it should be driven by an analysis of use cases for emojis in
identifiers and the impact on future evolution (emojis are symbols). For
example Swift found itself in a situation where some emojis are considered
identifiers and other custom operators.
Same for other scripts, individual letters are allowed but ZWNJ are not.
UAX#31 lists these scenarios. A quick survey seems to show that there is
no demand for Farsi, for example, because neither tools or people like to
deal with mixed-directions languages. Is that a chicken egg problem ? Maybe
Other partially supported scripts are those which use a virama
https://en.wikipedia.org/wiki/Virama , which is not
My understanding is that it is not customary for Brahmic scripts to be used
in programming languages, because of poor IDE or input support, or cultural
I would definitely love to see a proposal for this, but it should
ultimately be driven by people familiar with these scripts and who
understand the demand for them.
And again, we need to adopt the proposal as presented to find ourselves in
a clean slate from which we can build upon iteratively as/if demand arises.
> Although, I still don't think the paper should be accepted *as-is*, but I
> do agree that the clarifications proposed above by Tom would go a long way
> to convince me (and maybe others?) that UAX#31 is the only sensible way
> to go.
> SG16 mailing list
SG16 list run by firstname.lastname@example.org