sg16: Re: [SG16-Unicode] Identifiers in C++

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 10 May 2019 19:35:23 +0200

I'm all for fixing that personally with restrictions probably

1/ Let's not change the character set acceptable in include directives.
Mapping non-ascii to filename is a portability nightmare
2/ I am not confortable with allowing that in module names (in part for the
same reasons), but i don't think it should be restricted at the language
level either
3/ TR31 is a very good start but probably too lenient about mixed script
identifiers (see for example http://perl11.org/blog/unicode-identifiers.html )
- however we should defer to the Unicode TR as much as possible rather to
pretend we have a better understanding of the issue than they do

Overall, it's something that I wish was implemented but that I think people
should not use outside of novelty.

Implementations would have to respect TR31, which implies to use icu until
we actually ship unicode support.

I'm afraid such changes will make it harder for developers to work on
system that do not support Unicode, which is a good reason to mandate it,
especially given it would have no bearing on the platforms the code can run
on :)

As for module names, i don't think we get to chose

There is some file, with a given name which is a bag of bytes and which we
don't get to rename
There is some module with some name which is the basic character set (or
with your proposal at hand a Unicode identifier), which we don't get to
rename.
The two must match _somehow_

So either we limit the identifiers to ASCII or we "enforce" (by the way of
the TR) that filenames must be valid utf8-encoded Unicode matching the
module name.
Unfortunately, not all file-systems will support that.
The limitation is more related to file systems that it is related to C++
and we have virtually no control beside restricting the set of filesystems
that are able to store
C++. Which I am all for but I don't think people will go for that.

Alternatively we don't try to put any restrictions and people will
ultimately realize what and what doesn't work or let tools set their own
restrictions. Which doesn't help
the ecosystem at all - but it's basically what we have always done

On Fri, 10 May 2019 at 18:43, JF Bastien <cxx_at_[hidden]> wrote:

> Hi C++ પกٱƈѻɗﻉ ḟäṅṡ 👋!
>
> The current list of valid identifier characters is pretty silly (see [*lex.name
> <http://lex.name>*] 5.10 Identifiers or cppreference summary
> <https://en.cppreference.com/w/cpp/language/identifiers>). It allows
> characters such as zero-width joiner and zero-width space among a few silly
> things (see how bad this can get <https://godbolt.org/z/sBJk1k>,
> h/t Richard Kogelnig).
>
> I asked where it came from, and IIUC John looked at Unicode and cobbled
> the list of valid ranges manually. That ain't great.
>
> Is this group interested in fixing things?
>
> There's already an existing standard for this, maybe it's a thing we can
> adopt as-is or use as a starting point:
>
> https://unicode.org/reports/tr31/
>
>
> Further, the tooling group was just talking about module names. I think we
> should allow any valid identifier name as module name, and look at how this
> could map to file names for a tooling TR's purpose.
>
> Thanks,
>
> JF
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-05-10 19:35:36