I'm all for fixing that personally with restrictions probably

1/ Let's not change the character set acceptable in include directives. Mapping non-ascii to filename is a portability nightmare

2/ I am not confortable with allowing that in module names (in part for the same reasons), but i don't think it should be restricted at the language level either

3/ TR31 is a very good start but probably too lenient about mixed script identifiers (see for example http://perl11.org/blog/unicode-identifiers.html ) - however we should defer to the Unicode TR as much as possible rather to pretend we have a better understanding of the issue than they do

Overall, it's something that I wish was implemented but that I think people should not use outside of novelty.

Implementations would have to respect TR31, which implies to use icu until we actually ship unicode support.

I'm afraid such changes will make it harder for developers to work on system that do not support Unicode, which is a good reason to mandate it, especially given it would have no bearing on the platforms the code can run on :)

As for module names, i don't think we get to chose

There is some file, with a given name which is a bag of bytes and which we don't get to rename

There is some module with some name which is the basic character set (or with your proposal at hand a Unicode identifier), which we don't get to rename.

The two must match _somehow_

So either we limit the identifiers to ASCII or we "enforce" (by the way of the TR) that filenames must be valid utf8-encoded Unicode matching the module name.

Unfortunately, not all file-systems will support that.

The limitation is more related to file systems that it is related to C++ and we have virtually no control beside restricting the set of filesystems that are able to store

C++. Which I am all for but I don't think people will go for that.

Alternatively we don't try to put any restrictions and people will ultimately realize what and what doesn't work or let tools set their own restrictions. Which doesn't help

the ecosystem at all - but it's basically what we have always done

On Fri, 10 May 2019 at 18:43, JF Bastien <cxx@jfbastien.com> wrote:

Hi C++ પกٱƈѻɗﻉ ḟäṅṡ 👋!

The current list of valid identifier characters is pretty silly (see [lex.name] 5.10 Identifiers or cppreference summary). It allows characters such as zero-width joiner and zero-width space among a few silly things (see how bad this can get, h/t Richard Kogelnig).

I asked where it came from, and IIUC John looked at Unicode and cobbled the list of valid ranges manually. That ain't great.

Is this group interested in fixing things?

There's already an existing standard for this, maybe it's a thing we can adopt as-is or use as a starting point:
https://unicode.org/reports/tr31/

Further, the tooling group was just talking about module names. I think we should allow any valid identifier name as module name, and look at how this could map to file names for a tooling TR's purpose.

Thanks,

JF
_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode