On Tue, 23 Jun 2020 at 15:20, Marcos Bento via SG16 <sg16@lists.isocpp.org> wrote:
On Mon, Jun 22, 2020 at 6:24 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
On 6/19/20 11:07 AM, Steve Downey via SG16 wrote:
While we don't exclude scripts generally, by not doing script analysis, the lack of ZWJ and ZWNJ makes some words in Indic scripts problematic. The examples in https://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters are relevant. Zero Width Joiner and Zero Width Non-Joiner are used in Farsi, Malayalam, and Sinhala.
Perhaps a revision of the paper can note the possibility of such script analysis as a possible future direction?

Wikipedia https://en.wikipedia.org/wiki/Zero-width_joiner#Examples mentions Devanagari and Kannada, although it appears that recent editions of Unicode may have added explicit characters in Devanagari to alleviate the problem.

It isn't clear to me that we have a good list of scripts that are (partially) excluded by P1949R4.  Is it reasonable to identify that set and include it in a revision?  Perhaps noting that Unicode is evolving to better handle them such that they will not be excluded in the future?


Script recognition would also be necessary to identify the "emoji" script to allow sequences, as well as expanding the repertoire of allowed characters to include the currently explicitly disallowed emoji, the ones that were known at the time the allowed character ranges in C++ was put together.

And finally, perhaps the next revision can acknowledge this as a possibility and attempt to qualify the technical impact?  I suspect JF will want to poll inclusion of emoji, so the more we can do to inform EWG on the consequences of doing so, the better.

Tom.

Since the telecon, like Martinho,  I've been closely following the discussion to better understand the points raised by the authors of the paper.

My main concern is not Emoji per se, but other relevant scripts that might fall through the cracks of UAX#31. I like the approach that anyone that wants Emoji should try and change UAX#31, but the limitations that are being suggested should be clear.

That is not what is suggested.

Emojis are not just codepoints, they are a complex grammar that cannot just be "allowed", there needs to be a non trivial support. In fact the only sane way to support that would be to only allow the "recommended for general interchange" emojis from a specific list.
See http://unicode.org/reports/tr51/ for details.

But again it should be driven by an analysis of use cases for emojis in identifiers and the impact on future evolution (emojis are symbols). For example Swift found itself in a situation where some emojis are considered identifiers and other custom operators.

Same for other scripts, individual letters are allowed but ZWNJ are not.
UAX#31 lists these scenarios.  A quick survey seems to show that there is no demand for Farsi, for example, because neither tools or people like to deal with mixed-directions languages. Is that a chicken egg problem ? Maybe

Other partially supported scripts are those which use a virama https://en.wikipedia.org/wiki/Virama , which is not semantically meaningful. 

My understanding is that it is not customary for Brahmic scripts to be used in programming languages, because of poor IDE or input support, or cultural reasons.
I would definitely love to see a proposal for this, but it should ultimately be driven by people familiar with these scripts and who understand the demand for them.

And again, we need to adopt the proposal as presented to find ourselves in a clean slate from which we can build upon iteratively as/if demand arises.

 

Although, I still don't think the paper should be accepted as-is, but I do agree that the clarifications proposed above by Tom would go a long way to convince me (and maybe others?) that UAX#31 is the only sensible way to go.

-Marcos
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16