Date: Fri, 10 May 2019 19:35:43 -0400
Notionally, this is already a problem. The standard says that we translate
to universal character names really early.
As a practical matter, if you use ones outside the basic set you are just
making life difficult for everyone. Normalization just makes everything
worse, too, if you do anything 'interesting'. Some filesystems don't
actually even require well formed unit.
It's probably something to mention in the TR.
On Fri, May 10, 2019, 13:35 Corentin Jabot <corentinjabot_at_[hidden]> wrote:
> I'm all for fixing that personally with restrictions probably
>
> 1/ Let's not change the character set acceptable in include directives.
> Mapping non-ascii to filename is a portability nightmare
> 2/ I am not confortable with allowing that in module names (in part for
> the same reasons), but i don't think it should be restricted at the
> language level either
> 3/ TR31 is a very good start but probably too lenient about mixed script
> identifiers (see for example
> http://perl11.org/blog/unicode-identifiers.html ) - however we should
> defer to the Unicode TR as much as possible rather to pretend we have a
> better understanding of the issue than they do
>
> Overall, it's something that I wish was implemented but that I think
> people should not use outside of novelty.
>
> Implementations would have to respect TR31, which implies to use icu until
> we actually ship unicode support.
>
> I'm afraid such changes will make it harder for developers to work on
> system that do not support Unicode, which is a good reason to mandate it,
> especially given it would have no bearing on the platforms the code can run
> on :)
>
>
>
> As for module names, i don't think we get to chose
>
> There is some file, with a given name which is a bag of bytes and which we
> don't get to rename
> There is some module with some name which is the basic character set (or
> with your proposal at hand a Unicode identifier), which we don't get to
> rename.
> The two must match _somehow_
>
> So either we limit the identifiers to ASCII or we "enforce" (by the way of
> the TR) that filenames must be valid utf8-encoded Unicode matching the
> module name.
> Unfortunately, not all file-systems will support that.
> The limitation is more related to file systems that it is related to C++
> and we have virtually no control beside restricting the set of filesystems
> that are able to store
> C++. Which I am all for but I don't think people will go for that.
>
> Alternatively we don't try to put any restrictions and people will
> ultimately realize what and what doesn't work or let tools set their own
> restrictions. Which doesn't help
> the ecosystem at all - but it's basically what we have always done
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 10 May 2019 at 18:43, JF Bastien <cxx_at_[hidden]> wrote:
>
>> Hi C++ પกٱƈѻɗﻉ ḟäṅṡ 👋!
>>
>> The current list of valid identifier characters is pretty silly (see [*lex.name
>> <http://lex.name>*] 5.10 Identifiers or cppreference summary
>> <https://en.cppreference.com/w/cpp/language/identifiers>). It allows
>> characters such as zero-width joiner and zero-width space among a few silly
>> things (see how bad this can get <https://godbolt.org/z/sBJk1k>,
>> h/t Richard Kogelnig).
>>
>> I asked where it came from, and IIUC John looked at Unicode and cobbled
>> the list of valid ranges manually. That ain't great.
>>
>> Is this group interested in fixing things?
>>
>> There's already an existing standard for this, maybe it's a thing we can
>> adopt as-is or use as a starting point:
>>
>> https://unicode.org/reports/tr31/
>>
>>
>> Further, the tooling group was just talking about module names. I think
>> we should allow any valid identifier name as module name, and look at how
>> this could map to file names for a tooling TR's purpose.
>>
>> Thanks,
>>
>> JF
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
to universal character names really early.
As a practical matter, if you use ones outside the basic set you are just
making life difficult for everyone. Normalization just makes everything
worse, too, if you do anything 'interesting'. Some filesystems don't
actually even require well formed unit.
It's probably something to mention in the TR.
On Fri, May 10, 2019, 13:35 Corentin Jabot <corentinjabot_at_[hidden]> wrote:
> I'm all for fixing that personally with restrictions probably
>
> 1/ Let's not change the character set acceptable in include directives.
> Mapping non-ascii to filename is a portability nightmare
> 2/ I am not confortable with allowing that in module names (in part for
> the same reasons), but i don't think it should be restricted at the
> language level either
> 3/ TR31 is a very good start but probably too lenient about mixed script
> identifiers (see for example
> http://perl11.org/blog/unicode-identifiers.html ) - however we should
> defer to the Unicode TR as much as possible rather to pretend we have a
> better understanding of the issue than they do
>
> Overall, it's something that I wish was implemented but that I think
> people should not use outside of novelty.
>
> Implementations would have to respect TR31, which implies to use icu until
> we actually ship unicode support.
>
> I'm afraid such changes will make it harder for developers to work on
> system that do not support Unicode, which is a good reason to mandate it,
> especially given it would have no bearing on the platforms the code can run
> on :)
>
>
>
> As for module names, i don't think we get to chose
>
> There is some file, with a given name which is a bag of bytes and which we
> don't get to rename
> There is some module with some name which is the basic character set (or
> with your proposal at hand a Unicode identifier), which we don't get to
> rename.
> The two must match _somehow_
>
> So either we limit the identifiers to ASCII or we "enforce" (by the way of
> the TR) that filenames must be valid utf8-encoded Unicode matching the
> module name.
> Unfortunately, not all file-systems will support that.
> The limitation is more related to file systems that it is related to C++
> and we have virtually no control beside restricting the set of filesystems
> that are able to store
> C++. Which I am all for but I don't think people will go for that.
>
> Alternatively we don't try to put any restrictions and people will
> ultimately realize what and what doesn't work or let tools set their own
> restrictions. Which doesn't help
> the ecosystem at all - but it's basically what we have always done
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 10 May 2019 at 18:43, JF Bastien <cxx_at_[hidden]> wrote:
>
>> Hi C++ પกٱƈѻɗﻉ ḟäṅṡ 👋!
>>
>> The current list of valid identifier characters is pretty silly (see [*lex.name
>> <http://lex.name>*] 5.10 Identifiers or cppreference summary
>> <https://en.cppreference.com/w/cpp/language/identifiers>). It allows
>> characters such as zero-width joiner and zero-width space among a few silly
>> things (see how bad this can get <https://godbolt.org/z/sBJk1k>,
>> h/t Richard Kogelnig).
>>
>> I asked where it came from, and IIUC John looked at Unicode and cobbled
>> the list of valid ranges manually. That ain't great.
>>
>> Is this group interested in fixing things?
>>
>> There's already an existing standard for this, maybe it's a thing we can
>> adopt as-is or use as a starting point:
>>
>> https://unicode.org/reports/tr31/
>>
>>
>> Further, the tooling group was just talking about module names. I think
>> we should allow any valid identifier name as module name, and look at how
>> this could map to file names for a tooling TR's purpose.
>>
>> Thanks,
>>
>> JF
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2019-05-11 01:35:59