Date: Thu, 11 Aug 2022 13:09:49 +0000
Hi Dennis,
Thank you for raising this.
In my view, it is *highly* desirable for the C++ standard to stick as closely as possible to UAX #31. "What Unicode characters are suitable for use in identifiers" is an appropriate question to be dealt with by experts in text and abstract character semantics, i.e. the Unicode Consortium. It's highly beneficial for C++ and the wider programming language ecosystem to have a common, shared understanding of what well-formed identifiers look like, to enable cross-language interoperability and common understanding, and I have been pleased to see several programming languages converging on UAX #31 conformance in recent years.
Therefore my recommendation to you is to engage closely with Unicode Technical Committee (and Mark Davis in particular) to update or improve UAX #31. When a new revision of UAX #31 is available, I will be very happy to do whatever work is needed to bring the C++ IS up-to-date with it. In the meantime, I am sure that implementers will be interested in adding compiler extensions to assist backward compatibility.
Best regards,
Peter
-----Original Message-----
From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Ogiermann, Dennis via SG16
Sent: 11 August 2022 12:49
To: sg16_at_[hidden]cpp.org
Cc: Ogiermann, Dennis <dennis.ogiermann_at_[hidden]-bochum.de>
Subject: [SG16] P1949 negative impact on math heavy code
EXTERNAL MAIL
Dear SG16 members,
I have been referred to this mailing list to share some concerns regarding the adoption of P1949. Some preliminary discussion can be found at a Github issue that I have opened https://urldefense.com/v3/__https://github.com/sg16-unicode/sg16/issues/77__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMDWN9A-1E$ . Let me summarize this so far and respond here.
Initial Post:
In the scientific computing and numerical community the utilization of Unicode identifiers has gained a bit of popularity over the recent years, because it helps in keeping the code close to theory. Let us take for example this piece of math tex:
$$
\Delta t_{n+1} = \varepsilon_{n+1}^{\beta_1/k} \cdot \varepsilon_{n}^{\beta_2/k} \cdot \varepsilon_{n-1}^{\beta_3/k} \cdot \Delta t_{n}
$$
pre-P1949 we were able to implement this as
```
const auto Δtₙ₊₁ = std::pow(εₙ₊₁, β₁/k) * std::pow(εₙ, β₂/k) * std::pow(εₙ₋₁, β₃/k) * Δtₙ;
```
which reads like the formulas, thus reducing mental overhead when writing code. I know, this is a simple and short example where it may not be directly clear that this is helpful, but once formulas start to get longer and the implementation spans in the order of hundreds of lines of code, than this can really simplify development, maintenance and debugging a lot. https://urldefense.com/v3/__https://godbolt.org/z/nG7o5K141__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMDPwAOf_g$ is an example - I also want to note for completeness that MSVC seem to have not supported such constructs in first place.
I appreciate the time to work on the standard, but I think this proposal is the exact opposite of what we developers in the numerical community need. [...] I also noticed that super-/subscript letters and some super-/subscript symbols seem to be still valid, causing some weird inconsistency. [...]
The current direction also raises more questions from my side:
1. Are there plans to restrict allowed characters further, especially the ones used in the computational sciences/basic math notation?
2. Is there the possibility to bring back the numerical super-/subscripts?
3. Related to this, and I know that the unicode consortium does want to hear this, but since this is really useful for the numerical community, is there any possibility to at least allow the very basic standard letters (greek and latin) in super-/subscripts - either directly via unicode or some extra mechanism in the language/editors? Yes, I read the opinions on this (and I am absolutely not fan of it, less am I agreeing), but viewing this from user-perspective, having just some characters avaiblable is really weird.
Jens Maurer and Tom Honermann has answered some of my initial questions in the linked Github issue, which I also quickly summarize:
* Identifier definition has been outsource to the Unicode consortium (Unicode Standard Annex 31 defines a recommendation)
* SG16 is at least aware of the problem mentioned above
* A member of the Unicode Consortium is currently working on improvements to Unicode to support source code as text
* This mailing list is the correct medium to discuss issues concerning SG16 (in contrast to the Github org and associated repos)
Now, my understanding of Annex 31 is that it gives only recommendations for a formal grammar to describe identifiers, which is fine and I think also a necessary step. Also from my understanding P1949R7 is the corresponding paper describing how Annex 31 is adopted into the C++ standard. Trying to throw in something constructive, I think it might make sense to extend the alphabet for either XID_Start or XID_Continue with 2080..208E and 2090..209C. Also, including 00B2,00B3,00B9,2070,2071,2073..207E might make sense, but this may require more discussion, whether it such characters might be blocked for other purposes. However, my thoughts might be a bit short-sighted from a language evolution perspective, because it is merely a hotfix for what is helpful for writing math heavy code (e.g. 2090..209C is an incomplete character range).
Thank you for your time and best regards,
Dennis
--
SG16 mailing list
SG16_at_lists.isocpp.org
https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMD95LCr4U$
Thank you for raising this.
In my view, it is *highly* desirable for the C++ standard to stick as closely as possible to UAX #31. "What Unicode characters are suitable for use in identifiers" is an appropriate question to be dealt with by experts in text and abstract character semantics, i.e. the Unicode Consortium. It's highly beneficial for C++ and the wider programming language ecosystem to have a common, shared understanding of what well-formed identifiers look like, to enable cross-language interoperability and common understanding, and I have been pleased to see several programming languages converging on UAX #31 conformance in recent years.
Therefore my recommendation to you is to engage closely with Unicode Technical Committee (and Mark Davis in particular) to update or improve UAX #31. When a new revision of UAX #31 is available, I will be very happy to do whatever work is needed to bring the C++ IS up-to-date with it. In the meantime, I am sure that implementers will be interested in adding compiler extensions to assist backward compatibility.
Best regards,
Peter
-----Original Message-----
From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Ogiermann, Dennis via SG16
Sent: 11 August 2022 12:49
To: sg16_at_[hidden]cpp.org
Cc: Ogiermann, Dennis <dennis.ogiermann_at_[hidden]-bochum.de>
Subject: [SG16] P1949 negative impact on math heavy code
EXTERNAL MAIL
Dear SG16 members,
I have been referred to this mailing list to share some concerns regarding the adoption of P1949. Some preliminary discussion can be found at a Github issue that I have opened https://urldefense.com/v3/__https://github.com/sg16-unicode/sg16/issues/77__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMDWN9A-1E$ . Let me summarize this so far and respond here.
Initial Post:
In the scientific computing and numerical community the utilization of Unicode identifiers has gained a bit of popularity over the recent years, because it helps in keeping the code close to theory. Let us take for example this piece of math tex:
$$
\Delta t_{n+1} = \varepsilon_{n+1}^{\beta_1/k} \cdot \varepsilon_{n}^{\beta_2/k} \cdot \varepsilon_{n-1}^{\beta_3/k} \cdot \Delta t_{n}
$$
pre-P1949 we were able to implement this as
```
const auto Δtₙ₊₁ = std::pow(εₙ₊₁, β₁/k) * std::pow(εₙ, β₂/k) * std::pow(εₙ₋₁, β₃/k) * Δtₙ;
```
which reads like the formulas, thus reducing mental overhead when writing code. I know, this is a simple and short example where it may not be directly clear that this is helpful, but once formulas start to get longer and the implementation spans in the order of hundreds of lines of code, than this can really simplify development, maintenance and debugging a lot. https://urldefense.com/v3/__https://godbolt.org/z/nG7o5K141__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMDPwAOf_g$ is an example - I also want to note for completeness that MSVC seem to have not supported such constructs in first place.
I appreciate the time to work on the standard, but I think this proposal is the exact opposite of what we developers in the numerical community need. [...] I also noticed that super-/subscript letters and some super-/subscript symbols seem to be still valid, causing some weird inconsistency. [...]
The current direction also raises more questions from my side:
1. Are there plans to restrict allowed characters further, especially the ones used in the computational sciences/basic math notation?
2. Is there the possibility to bring back the numerical super-/subscripts?
3. Related to this, and I know that the unicode consortium does want to hear this, but since this is really useful for the numerical community, is there any possibility to at least allow the very basic standard letters (greek and latin) in super-/subscripts - either directly via unicode or some extra mechanism in the language/editors? Yes, I read the opinions on this (and I am absolutely not fan of it, less am I agreeing), but viewing this from user-perspective, having just some characters avaiblable is really weird.
Jens Maurer and Tom Honermann has answered some of my initial questions in the linked Github issue, which I also quickly summarize:
* Identifier definition has been outsource to the Unicode consortium (Unicode Standard Annex 31 defines a recommendation)
* SG16 is at least aware of the problem mentioned above
* A member of the Unicode Consortium is currently working on improvements to Unicode to support source code as text
* This mailing list is the correct medium to discuss issues concerning SG16 (in contrast to the Github org and associated repos)
Now, my understanding of Annex 31 is that it gives only recommendations for a formal grammar to describe identifiers, which is fine and I think also a necessary step. Also from my understanding P1949R7 is the corresponding paper describing how Annex 31 is adopted into the C++ standard. Trying to throw in something constructive, I think it might make sense to extend the alphabet for either XID_Start or XID_Continue with 2080..208E and 2090..209C. Also, including 00B2,00B3,00B9,2070,2071,2073..207E might make sense, but this may require more discussion, whether it such characters might be blocked for other purposes. However, my thoughts might be a bit short-sighted from a language evolution perspective, because it is merely a hotfix for what is helpful for writing math heavy code (e.g. 2090..209C is an incomplete character range).
Thank you for your time and best regards,
Dennis
--
SG16 mailing list
SG16_at_lists.isocpp.org
https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!Er45KwbhhoVwqL2Y9MxZrma-WcePTHIiHXHdAp_ze9sGuvRchdQ0ysW-8jJGALVOBpllBNMD95LCr4U$
Received on 2022-08-11 13:09:55