sg16: Re: [SG16] P2071 - Named universal character escapes

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 15 Sep 2021 21:49:35 +0200

On Wed, Sep 15, 2021 at 9:39 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> I attached a draft R1 with changes I had previously worked on. I briefly
> looked at it and I don't think I left it in a half-baked state, but it
> would be worth diffing it against the P0 revision with a reasonable HTML
> diffing tool to make sure. The "Changes since P2071R0" section suggest I
> addressed the issues raised in Prague.
>
> The todo list I have includes:
>
> - Add discussion regarding the use of \N{...} in identifiers.
> - Add a proposal option to allow use of \N{...} in identifiers.
> - Rebase wording on the current WD; particularly due to the adoption
> of P2029.
> - Implement the proposal.
>
> Richard Smith had requested that \N{...} be allowed in identifiers for
> consistency with \u and \U. We should, of course, just acknowledge that
> Richard is always right and do that :)
>

I do, however, question its usefulness :)
It doesn't cost much in terms of wording/implementation.
I could probably whipped a clang prototype fairly rapidly if that's useful

I think by some heroics we will get Jens paper approved at october's
plenary and can rebase the wording on top of it
then

>
> Wording changes may additionally be needed for P2314. Maybe for one or
> more of Corentin's recent papers as well.
>
> Tom.
>
> On 9/15/21 3:21 PM, Steve Downey wrote:
>
> https://github.com/cplusplus/papers/issues/798#issuecomment-585750666 has
> notes from JF
>
> EWG Prague Thursday afternoon:
>
> We’re interested in supporting named universal character escapes.
> SF F N A SA
> 14 5 0 0 0
>
> This should further support aliases.
> SF F N A SA
> 18 2 1 0 0
>
> It should further be case insensitive.
> SF F N A SA
> 0 6 6 9 2
>
> It should further support UAX44-LM2 with arbitrary spaces and dashes.
> SF F N A SA
> 1 4 5 8 5
>
> The paper is *not* tentatively ready yet. We want to see the updated
> paper before marking it as tentatively ready.
>
>
> I missed Prague, but this might be enough, if you don't have any more
> detailed notes. I can check the wiki as well.
>
> On Wed, Sep 15, 2021 at 2:46 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 9/15/21 2:31 PM, Steve Downey wrote:
>> > If I am reading the github correctly, EWG would like to see some
>> > revision before picking it up again. Is there something I can
>> > help with? This looks like it's really close and desired and quite
>> > possible for 23?
>>
>> Yes. I recall there not being much to do, but I need to find my list of
>> what that is. I would very much appreciate the help. I'll hunt down that
>> list and try to get it to you later today or tomorrow.
>>
>> Tom.
>>
>>
> Document Number: D2071R1 *Draft*
> Date: 2020-06-04
> Audience: SG16, EWG
> Reply-to: Tom Honermann <tom_at_[hidden]>
> R. Martinho Fernandes <rmf_at_[hidden]>
> Peter Bindels <peterbindels_at_[hidden]>
> Corentin Jabot <corentin.jabot_at_[hidden]> Named universal character
> escapes
>
> - Introduction <#m_6609508885546752959_introduction>
> - Changes since P2071R0 <#m_6609508885546752959_changes>
> - History <#m_6609508885546752959_history>
> - Motivation <#m_6609508885546752959_motivation>
> - Design considerations <#m_6609508885546752959_design>
> - Syntax <#m_6609508885546752959_design_syntax>
> - Name sources <#m_6609508885546752959_design_names>
> - Name matching <#m_6609508885546752959_design_matching>
> - Portable names <#m_6609508885546752959_design_portability>
> - Existing practice
> <#m_6609508885546752959_design_existing_practice>
> - Backward compatibility <#m_6609508885546752959_design_compat>
> - Implementor impact <#m_6609508885546752959_design_impact>
> - Design alternatives <#m_6609508885546752959_design_alt>
> - Proposal <#m_6609508885546752959_proposal>
> - Possible future extensions <#m_6609508885546752959_future>
> - Implementation experience <#m_6609508885546752959_implementation_exp>
> - Acknowledgements <#m_6609508885546752959_acknowledgements>
> - References <#m_6609508885546752959_references>
> - Core wording <#m_6609508885546752959_core_wording>
>
> Introduction
>
> This proposal continues the effort R. Martinho Fernandes initiated that
> culminated in P1097R2 <https://wg21.link/p1097r2>[P1097R2]
> <#m_6609508885546752959_ref_p1097r2>. This proposal does not deviate from
> the general design intent in Fernandes' work, but does deviate in a few
> details. See the History <#m_6609508885546752959_history> and Proposal
> <#m_6609508885546752959_proposal> sections for more information.
>
> C++ programmers have been able to portably use characters outside of the
> basic source character set in character and string literals since the
> introduction of *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s in
> C++11. For example:
>
> U'\u0100' // UTF-32 character literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON}
> u8"\u0100\u0300" // UTF-8 string literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON} U+0300 {COMBINING GRAVE ACCENT}
>
> This proposal enables the above literals to be written using Unicode
> assigned names instead of Unicode code point values.
>
> U'\N{LATIN CAPITAL LETTER A WITH MACRON}' // Equivalent to U'\u0100'
> u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT" // Equivalent to u8"\u0100\u0300"
>
> This paper discusses and links to work completed by Corentin Jabot that
> investigates implementation impact, though an implementation has not yet
> been completed in an existing compiler. This paper also includes discussion
> regarding alternative design possibilities.
> Changes since P2071R0 <https://wg21.link/p2071r0>
>
> - Updated the proposal to match the EWG design consensus reached in
> Prague. Removed the proposal options section.
> - Moved some content previously in the introduction section into a new
> history section.
> - Added results of SG16 and EWG polls taken in Prague.
> - Updated the existing practice section to correctly describe the name
> matching behavior of other languages where the behavior was previously
> uncertain.
> - Updated uses of U+NNNN to correctly follow Unicode notational
> conventions.
>
> History
>
> Prior presentations of P1097 to EWG-I and EWG received strong
> encouragement and useful design feedback:
>
> - Review of P1097R1 <https://wg21.link/p1097r1> by EWG-I in San Diego,
> 2018 <http://wiki.edg.com/bin/view/Wg21sandiego2018/P1097R1>:
> - *Do we want named escape sequences?*
> SFFNASA
> 5 9 7 0 0
> - *Do we want to support name aliases?*
> SFFNASA
> 12 8 1 0 0
> - *Do we want case-insensitive matching?*
> SFFNASA
> 5 7 4 4 1
> - *Do we want full UAX #44 LM2 name matching?*
> SFFNASA
> 0 0 7 7 7
> - Review of P1097R2 <https://wg21.link/p1097r2> by EWG in Belfast,
> 2019 <http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG>:
> - *EWG wants to encourage further work in this area*
> SFFNASA
> 8 16 8 1 1 Motion passes
> - *Accept P1097 as presented for C++23*
> SFFNASA
> 2 9 13 5 1 No consensus. Author encouraged to do further work
>
> Two areas of concern were raised during discussion in EWG in Belfast, 2019
> <http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG>:
>
> - *Implementation impact*
> The Unicode name database (names and aliases), in text form, is ~1.5
> MiB and a naive implementation could significantly impact the size of
> compiler distributions. This was of particular concern to organizations
> that distribute compilers as part of a distributed build process.
> - *Design concerns*
> One EWG member strongly preferred a library based design that would
> have a smaller impact on the core language. For example, a string
> interpolation based design.
>
> The implementation concerns prompted Corentin Jabot to explore
> implementation strategies as described in the Implementation experience
> <#m_6609508885546752959_implementation_exp> section.
>
> Despite the clear negative feedback from EWG-I with regard to use of
> UAX44-LM2 <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>
> to match character names, P2071R0 <https://wg21.link/p2071r0> proposed
> using UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>. This was
> motivated solely by Corentin Jabot's use of that algorithm in his
> implementation experiments.
>
> Presentation of P2071R0 <https://wg21.link/p2071r0> to SG16 and EWG in
> Prague received strong encouragement and consensus for design direction.
>
> - SG16 in Prague, 2020
> <http://wiki.edg.com/bin/view/Wg21prague/SG16P2071R0>:
> - *What is our preferred name matching algorithm?*
> In favorName match algorithm
> 6 Exact match.
> 6 Case insensitive
> 4 Full UAX44-LM2 No consensus for the UAX44-LM2 algorithm.
> - *We should support case-insensitive matching as opposed to exact
> match?*
> SFFNASA
> 2 3 2 1 2 Consensus? No
> SF: Matches implementations in other languages.
> SF: Mixed case is more legible than UPPERCASE.
> SA: This is an identifier in a case-sensitive language.
> SA: Increases maintenance in large code bases due to different
> style preferences; want one way to spell things.
> N: Want UAX44-LM2 because I'll constantly have to lookup correct
> names.
> - *Preferred syntax: (vote for 1)*
> In favorSyntax
> 8 Use "\N{XXX}"
> 0 Use "\u{XXX}" and "\U{XXX}" Strong consensus for the originally
> proposed syntax.
> F: Want to reserve \u for other potential extensions
> F: Matches other languages like Python.
> - *Match name aliases?*
> SFFNASA
> 8 2 0 0 0 Consensus? Yes
> - *Include support for ISO/IEC 10646 named sequences?*
> SFFNASA
> 0 0 1 6 1 Consensus? No
> SA: Adds implementation complexity for little benefit.
> A: Can be added later.
> - *Forward to EWG with: no UAX44-LM2 matching, no support for named
> sequences, use of \N, and no recommendation regarding case-sensitivity.*
> SFFNASA
> 7 3 0 0 0 Consensus? Yes
> - EWG in Prague, 2020
> <http://wiki.edg.com/bin/view/Wg21prague/P2071R0-EWG>:
> - *We are interesting in supporting named universal character
> escapes*
> SFFNASA
> 14 5 0 0 0
> - *This should further support aliases*
> SFFNASA
> 18 2 1 0 0
> - *It should further be case insensitive*
> SFFNASA
> 0 6 6 9 2
> - *It should further support UAX44-LM2 with arbitrary spaces and
> dadhes*
> SFFNASA
> 1 4 5 8 5
>
> Here again, clear negative feedback was provided with regard to use of the
> UAX44-LM2 <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>
> name matching algorithm. Additionally, the clearest guidance obtained so
> far was provided with regard to case-insensitivity. Corentin Jabot
> experimented and found that use of UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2> only
> reduced data size by about 9K; this delta is not significant. Revision
> P2071R1 <https://wg21.link/p2071r1> was therefore modified to match the
> EWG consensus to require exact name matches only.
> Motivation
>
> The introduction of *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s in
> C++11 benefitted programmers by allowing them to portably encode characters
> outside of the basic source character set without having to resort to use
> of octal or hexadecimal *escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:escape-sequence>s to explicitly
> encode code units. However, Unicode code points by themselves do not
> clearly communicate to readers of the code which character is to be
> encoded; hence the code comments included with the code examples in the
> introduction. Allowing programmers to directly use Unicode assigned
> character names avoids the need for side channel communications, like code
> comments, that might get out of sync over time.
>
> Use of UTF-8 as the encoding for source files has increased over time, but
> impediments to adoption remain. For example, Microsoft Visual C++ still
> defaults to a locale dependent encoding and that encourages limiting source
> files to ASCII. If the C++ community were to migrate en masse to UTF-8,
> then one might question whether *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s would
> become a legacy backward compatibility feature since programmers could
> reliably type the intended character in their source code directly. And if
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s were to
> become an anachronism, then what use would be served by introducing a named
> character escape?
>
> Unicode defines a number of characters that, even when they can be typed
> directly, can result in confusion. These include invisible characters such
> as U+200B {ZERO WIDTH SPACE}, combining characters such as U+0300
> {COMBINING GRAVE ACCENT}, visually indistinct characters such as U+003B
> {SEMICOLON} and U+037E {GREEK QUESTION MARK}, and characters with RTL
> (right-to-left) directionality. Consider how the following string literals
> containing these characters are rendered. In cases like these, use of
> escape sequences improves clarity; thus motivation for use of Unicode
> escape sequences will remain.
> ""
> "‏"
> "̀"
> ";"
> ";"
> "´"
> "́"
> "´"
> "Ω"
> "Ω"
> "A"
> "Α"
> "А"
> "Ꭺ"
> "ꓮ"
> "𐊠"
> "𖽀"
> // U+200B {ZERO WIDTH SPACE}
> // U+200F {RIGHT-TO-LEFT MARK}
> // U+0300 {COMBINING GRAVE ACCENT}
> // U+003B {SEMICOLON}
> // U+037E {GREEK QUESTION MARK}
> // U+00B4 {ACUTE ACCENT}
> // U+0301 {COMBINING ACUTE ACCENT}
> // U+1FFD {GREEK OXIA}
> // U+03A9 {GREEK CAPITAL LETTER OMEGA}
> // U+2126 {OHM SIGN}
> // U+0041 {LATIN CAPITAL LETTER A}
> // U+0391 {GREEK CAPITAL LETTER ALPHA}
> // U+0410 {CYRILLIC CAPITAL LETTER A}
> // U+13AA {CHEROKEE LETTER GO}
> // U+A4EE {LISU LETTER A}
> // U+102A0 {CARIAN LETTER A}
> // U+16F40 {MIAO LETTER ZZYA}
>
> Named character escapes are supported in various forms in other
> programming languages. The following is the result of a brief survey of
> various languages. For languages that include such support, more details
> can be found in the Design considerations <#m_6609508885546752959_design>
> section.
> Language Named character escape support
> C# No
> D Yes; HTML 5 named character references
> Go No
> Java No
> Javascript No
> Perl Yes; Unicode names, aliases, and named sequences
> PHP No
> Python Yes; Unicode names and aliases
> Raku Yes; Unicode names, aliases, named sequences, and emoji sequences
> Ruby No
> Rust No
> Swift No
> Visual Basic No
>
> Design considerations
>
> There are numerous choices for how support for named characters can be
> integrated into C++. Useful questions for making design choices include:
>
> - Which names will be recognized? Can multiple names for the same
> character exist?
> - How will names be matched? Must they be exact? Case insensitive?
> - How will support for new names affect backward compatibility?
> - How will the requirement for a name database impact implementations?
> - What syntax to use?
> - What is existing practice in other languages?
>
> This section analyzes the various options considered for this proposal.
>
> Syntax
>
> Named character escapes are proposed as a more readable alternative to
> universal-character-name
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s. As
> such, it is desirable that they be similar in syntax to
> universal-character-name
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s and
> other existing escape sequences.
>
> The syntax proposed by Fernandes in P1097R2 <https://wg21.link/p1097r2>
> [P1097R2] <#m_6609508885546752959_ref_p1097r2> is modeled after the
> syntax adopted for Python and consists of a \N escape introducer followed
> by a name enclosed in curly brackets. For example:
>
> '\N{LATIN CAPITAL LETTER A}'
> "\N{LATIN CAPITAL LETTER A WITH MACRON}"
>
> Other choices for the escape introducer are possible; the Backward
> compatibility <#m_6609508885546752959_design_compat> section discusses
> some possible motivation for preferring \u and/or \U.
>
> Options for recognized names and how to match them are discussed in
> subsequent sections.
>
> As proposed, only one name is allowed per named character escape, but that
> is an artificial limitation. Raku allows a sequence of comma separated
> names to be specified in a single escape. This is a natural extension if
> names are permitted to identify sequences of characters instead of a single
> character. The following would all be equivalent. This proposal leaves this
> option to a future extension; see the Possible future extensions
> <#m_6609508885546752959_future> section.
>
> "\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}"
> "\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}"
> "\u0100\u0300"
>
> Perl and Raku both allow Unicode code point numbers to be specified as
> character names. Following suit would enable a syntax that avoids the
> strict 4 or 8 digit requirements of universal-character-name
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s and
> could allow the natural U+NNNN style frequently used to identify Unicode
> characters. The following could all be equivalent. This proposal also
> leaves this option for a future extension as discussed in the Possible
> future extensions <#m_6609508885546752959_future> section.
>
> "\N{U+0100}"
> "\N{U+100}"
> "\N{U+000100}"
> "\N{0x0100}"
> "\N{256}"
> "\u0100"
>
> Name sources
>
> A named character escape feature is not particularly useful unless
> accompanied by at least one source of character names. The following list
> contains sources of character names that are consulted by at least one
> implementation of named character escapes in another programming language.
>
> - Unicode assigned names (synchronized with ISO/IEC 10646)
> https://www.unicode.org/Public/12.0.0/ucd/NamesList.txt
> - Unicode aliases (synchronized with ISO/IEC 10646)
> https://www.unicode.org/Public/12.0.0/ucd/NameAliases.txt
> - Unicode named sequences (synchronized with ISO/IEC 10646)
> https://www.unicode.org/Public/12.0.0/ucd/NamedSequences.txt
> - Emoji ZWJ sequences
> https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt
> - Emoji sequences
> https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt
> - HTML named character references
>
> https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
>
> The first three are defined by the Unicode Consortium, part of the Unicode
> standard, and synchronized with ISO/IEC 10646. The names specified in each
> are designed in concert, share a common namespace, are immutable once
> published, and Unicode guarantees no conflicts between them. See the Unicode
> character encoding stability policy
> <https://www.unicode.org/policies/stability_policy.html>[UCESP]
> <#m_6609508885546752959_ref_ucesp> for more details. These sources are
> consulted for named character escapes in Perl, Python, and Raku.
>
> The next two sources specify emoji character sequences. Though produced by
> the Unicode Consortium, they are not part of the Unicode standard, and are
> not covered by the Unicode character encoding stability policy
> <https://www.unicode.org/policies/stability_policy.html>[UCESP]
> <#m_6609508885546752959_ref_ucesp>. These two sources don't technically
> provide names; they provide optional descriptions. The provided
> descriptions use characters, particularly : and ,, that are disallowed in
> the names provided by the first three sources. These sources are consulted
> for named character escapes in Raku.
>
> The last source is the specification of names recognized for use as named
> character references in HTML documents. This source is used for the
> implementation of named character escapes in the D programming language.
>
> The stability guarantees offered by the Unicode standard are a strong
> motivator for their use and, as such, this proposal adopts them as the name
> sources to use.
>
> The list of Unicode assigned names associates at most one name with each
> character. There are some characters that are not assigned a name in this
> list, for example, U+0080 is simply listed as a <control> character with
> no name. In some of these cases, the Unicode aliases list provides one or
> more names. For example, U+0080 has assigned aliases of PADDING CHARACTER
> (a figment alias) and PAD (an abbreviation alias).
>
> Unicode aliases provide another critical service. As mentioned above, once
> assigned, names are immutable. Corrections are only offered by providing an
> alias. Aliases come in five varieties:
>
> - *correction*
> Aliases for cases where an incorrect assigned name was published. For
> example, U+FE18 has an assigned name of PRESENTATION FORM FOR VERTICAL
> RIGHT WHITE LENTICULAR BRAKCET and a correction alias of PRESENTATION
> FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET (note the typo
> correction).
> - *control*
> Aliases for various control characters. For example, NULL for U+0000.
> - *alternate*
> Aliases for widely used alternate names. For example, BYTE ORDER MARK
> for U+FEFF.
> - *figment*
> Aliases for names that were documented, but never accepted in a
> standard. For example, HIGH OCTET PRESET for U+0081.
> - *abbreviation*
> Aliases for common abbreviations. For example, NBSP for U+00A0.
>
> It is conceivable that implementors could desire, or be requested to,
> support additional implementation-defined names; perhaps including from the
> additional sources listed above. Since new characters and names will
> continue to be added to the Unicode standard, caution is warranted to avoid
> the possibility of introducing conflicting names over time. The description
> of the UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2> name
> matching algorithm describes a historical case of how such a conflict once
> occurred. Any support for additional names should ensure that they occupy a
> non-overlapping namespace with the Unicode assigned names. Out of caution,
> this proposal disallows additional implementation-defined names.
> Name matching
>
> Names can be finicky things. Having to remember whether a name is, for
> example, ZERO WIDTH SPACE or ZERO-WIDTH SPACE is likely to frustrate
> programmers. Some programmers might prefer zero width space.
>
> Unicode provides a straight forward algorithm for matching names with
> various allowances including case-insensitivity, omission of some hyphens (
> -), and substitution of underscore (_) for space characters. UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2> is included
> in the Unicode standard via Unicode Standard Annex #44
> <https://www.unicode.org/reports/tr44/tr44-24.html>[UAX#44]
> <#m_6609508885546752959_ref_uax44>.
>
> The UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2> matching
> rule would accept any of the following names as a match for U+200B {ZERO
> WIDTH SPACE}
>
> ZERO WIDTH SPACEZERO-WIDTH SPACEzero-width spaceZERO width S P_A_C E
>
> Portable names
>
> Portably using named character escapes will require implementations to
> agree on a minimum version of the name sources.
>
> Thanks to the adoption of P1025R1 <https://wg21.link/p1025r1>[P1025R1]
> <#m_6609508885546752959_ref_p1025r1> in Rapperswil, 2019, the C++
> standard has a normative floating reference to ISO/IEC 10646
> <https://www.iso.org/standard/69119.html>[ISO/IEC10646]
> <#m_6609508885546752959_ref_10646>, the ISO/IEC standard that specifies a
> subset of what is specified in the Unicode standard and is kept
> synchronized with it. ISO/IEC 10646:2017 includes the Unicode assigned
> names (in section 33), name aliases (in section 33), and named character
> sequences (in section 27).
>
> The floating reference to ISO/IEC 10646 indicates a dependence on the
> version that is current at the time of standardization. Thus, conformance
> with the C++ standard will require conformance with the latest available
> publication of ISO/IEC 10646.
>
> Implementors must be allowed, and encouraged, to conform to more recent
> versions of ISO/IEC 10646 as they are published.
> Existing practice
>
> Support for named escape sequences exists in several programming
> languages. The following details of existing practice were obtained from
> these documentation sources.
> LanguageDocumentation link
> D https://dlang.org/spec/lex.html#StringLiteral
> Perl https://perldoc.perl.org/charnames.html
> Python
> https://docs.python.org/3.8/reference/lexical_analysis.html#literals
> Raku
> https://docs.raku.org/language/unicode#Entering_unicode_codepoints_and_codepoint_sequences
>
> Capabilities vary across languages:
> Language Name sources Comma separated names Name matching Matches code
> point numbers
> D HTML 5 No Case-sensitive and whitespace-sensitive. No
> Perl Unicode names
> Unicode name aliases
> Unicode named sequences
> registered custom aliases
> No By default, case-sensitive and whitespace-sensitive exact match.
> Optionally, script qualified short names with use charnames ':short';.
> Optionally, UAX44-LM2
> <https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2> with use
> charnames ':loose';. (case insensitive, ignore underscore, most spaces,
> and most non-medial hyphens) Yes
> Python Unicode names
> Unicode name aliases
> No Case-insensitive, but whitespace-sensitive No
> Raku Unicode names
> Unicode name aliases
> Unicode named sequences
> emoji ZWJ sequences
> emoji sequences
> Yes Case-insensitive, but whitespace-sensitive Yes
>
> Examples:
> Language Code
> D
>
> "\&Amacr;"
>
> Perl
>
> "\N{LATIN CAPITAL LETTER A WITH MACRON}"
> "\N{U+0100}"
>
> Python
>
> "\N{LATIN CAPITAL LETTER A WITH MACRON}"
>
> Raku
>
> "\c[LATIN CAPITAL LETTER A WITH MACRON]"
> "\c[256]"
> "\c[LATIN CAPITAL LETTER A WITH MACRON,COMBINING GRAVE ACCENT]"
> "\c[LATIN CAPITAL LETTER A WITH MACRON AND GRAVE]"
>
> Backward compatibility
>
> Escape sequences beyond those required in the standard are
> conditionally-supported ([lex.ccon]p7
> <http://eel.is/c++draft/lex.ccon#7.sentence-3>). For implementations that
> currently define a meaning for \N in character or string literals, the
> use of \N in this proposal is technically a breaking change.
>
> Gcc, Clang, and Microsoft Visual C++ all accept \N as an escape sequence
> with the semantic effect of substituting N such that "\N{xxx}" is
> equivalent to "N{xxx}". However, they each emit a warning regarding an
> unrecognized escape sequence, so reliance on this behavior is not likely to
> be common. Still, there are likely to be some uses in the wild (probably
> some percentage of that were intended to be \n).
>
> Another option would be to reuse the \u and/or \U introducer used for
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s. Gcc
> and Clang both reject code like "\u{xxx}" and "\U{xxx}" as containing
> ill-formed *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s.
> However, Microsoft Visual C++ accepts such uses without a warning and
> treats them as equivalent to "u{xxx} and "U{xxx}" respectively.
>
> The implementation divergence that occurs for the \u and \U cases above
> suggests that repurposing them may reduce the potential for backward
> compatibility impact. Use of \u and/or \U would potentially require more
> wording changes to distinguish named character escapes from
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s, but
> would be unlikely to pose a significant additional impact to implementors.
>
> For now, this proposal adheres to Fernandes' original design and retains
> use of \N as the introducer for named character escapes.
> Implementor impact
>
> The sources of character names listed in the Name sources
> <#m_6609508885546752959_design_names> section do not constitute big data
> by today's standards, but that does not mean that the volume of data and
> potential for impact to compiler distributions and compiler performance is
> insignificant. As mentioned earlier, some organizations have valid
> technical reasons to be sensitive to the size of the compiler distributions
> they use; in a distributed build environment that distributes compilers,
> the size of the distribution impacts latency and can therefore negatively
> impact build times.
>
> The combined size of the Unicode 12.0 text files containing the Unicode
> assigned names, aliases, and named character sequences is approximately 1.5
> MiB. A naive implementation might contribute 2+ MiB of code/data to a
> compiler. Some EWG members indicated that amount of increase is a cause for
> concern.
>
> Fortunately, naive implementations are not the only option. Corentin Jabot
> has done some excellent work to demonstrate that an implementation should
> be possible that increases the code/data size of a compiler by less than
> 300 KiB. See the Implementation experience
> <#m_6609508885546752959_implementation_exp> section for details.
> Corentin's approach is promising, but the additional complexity caries
> additional implementation cost and maintenance.
>
> Staying up to date with new Unicode releases will also, of course, pose an
> additional cost on implementors.
> Design alternatives
>
> As indicated previously, at least one EWG member in Belfast was strongly
> interested in a more general core language feature, presumably a string
> interpolation facility, that would allow named character escapes to be
> implemented as a library feature. Such a feature could take many forms, but
> might look something like the following where \{ is an escape sequence
> followed by a call to a constexpr function named nce with arguments
> passed in some form.
>
> "\{nce(LATIN CAPITAL LETTER A WITH GRAVE)}"
>
> Such a feature could certainly be implemented, but would seem to
> necessarily be more verbose and would necessitate inclusion of appropriate
> headers; headers that would be quite large in the case of a named character
> database or that would make use of a compiler intrinsic; which would put
> the complexity back in the compiler (though in implementation-defined
> territory rather than in standard core language). The verbosity concern
> could potentially be reduced by introducing core language sugar for
> lowering the proposed syntax to the example string interpolation syntax
> above.
> Proposal
>
> The wording included in this proposal is for the following design:
>
> - Context:
> - Named character escapes are valid only in character and string
> literals (not in identifiers).
> - Syntax:
> - \N{xxx} where xxx is the name of the character.
> - Name sources:
> - ISO/IEC 10646 assigned names.
> - ISO/IEC 10646 assigned name aliases.
> - No allowance for additional implementation-defined names.
> - Name matching:
> - case-sensitive and whitespace-sensitive exact matches.
> - Feature test macro:
> - __cpp_named_character_escapes
>
> Possible future extensions
>
> The following options are *not* currently proposed but could be
> considered for future extension.
>
> 1. Allow comma separated names. For example:
> - "\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}"
> // Equivalent to "\u0100\u0300"
> 2. Allow code point numbers as names. For example:
> - "\N{U+00C0}" // Equivalent to "\u00C0"
> - "\N{0x00C0}" // Equivalent to "\u00C0"
> - "\N{192}" // Equivalent to "\u00C0"
> 3. Allow names to match ISO/IEC 10646 named sequences such that the
> following would be equivalent:
> - "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
> - "\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT"
> - "\u0100\u0300"
> 4. Allow names to match Unicode emoji named sequences. For example:
> - "\N{keycap: #}" // Equivalent to
> "\u0023\uFE0F\u20E3"
> - "\N{Czech Republic}" // Equivalent to
> "\U0001F1E8\U0001F1FF"
> - "\N{waving hand: medium skin tone}" // Equivalent to
> "\U0001F1E8\U0001F1FF"
> 5. Allow names to match Unicode emoji ZWJ named sequences. For
> example:
> - "\N{man shrugging: medium skin tone}" // Equivalent to
> "\U0001F937\U0001F3FD\u200D\u2642\uFE0F"
> - "\N{rainbow flag}" // Equivalent to
> "\U0001F3F3\uFE0F\u200D\U0001F308"
> 6. Allow names to match HTML 5 named character references by
> surrounding them with & and ;. For example:
> - "\N{À}" // Equivalent to "\u00C0"
>
> Implementation experience
>
> This proposal has not yet been implemented in an existing compiler.
> However, the implementation concerns raised in Belfast prompted Corentin
> Jabot to conduct an experiement to determine how small the implementation
> overhead, in terms of data and code within the compiler, could be reduced
> to. His blog post <https://cor3ntin.github.io/posts/cp_to_name>[CJ_BLOG]
> <#m_6609508885546752959_ref_cj_blog> on the experiment reported that he
> was able to implement a function (cp_from_name
> <https://github.com/cor3ntin/ext-unicode-db/blob/name_to_cp/name_to_cp.hpp#L215-L260>)
> that accepts a Unicode 12.0 name or name alias and returns a code point
> value in under 300 KiB. His implementation is available in the cp_to_name
> branch of his ext-unicode-db GitHub repository at
> https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp[CJ_IMPL]
> <#m_6609508885546752959_ref_cj_impl>.
> Acknowledgements
>
> Thank you to R. Martinho Fernandes for taking the initiative to research
> and first propose support for named character escapes and for contributing
> his considerable expertise in general to SG16.
>
> Thank you to Corentin Jabot for the excellent work he did experimenting
> with and analyzing implementation impact. Without his work, the data
> necessary to respond to the implementation concerns raised in Belfast would
> not have been available at this time, thereby delaying further progress on
> this proposal.
>
> Thank you to Peter Bindels and Corentin Jabot for providing feedback on an
> initial draft that I delivered to them less than two hours before the
> Prague pre-meeting mailing deadline!
> References
> [CJ_BLOG] Corentin Jabot, "Storing Unicode: Character Name to Codepoint
> Mapping", 2019.
> https://cor3ntin.github.io/posts/cp_to_name
> [CJ_IMPL] Corentin Jabot, "ext-unicode-db", 2019.
> https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp
> [ISO/IEC10646] "Information technology — Universal Coded Character Set
> (UCS)", ISO/IEC 10646:2017, 2017.
> https://www.iso.org/standard/69119.html
> [N4835] "Working Draft, Standard for Programming Language C++", N4835,
> 2019.
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4835.pdf
> [P1025R1] Steve Downey, et al. "Update The Reference To The Unicode
> Standard", P1025R1, 2018.
> https://wg21.link/p1025r1
> [P1097R1] R. Martinho Fernandes, "Named character escapes", P1097R1, 2018.
> https://wg21.link/p1097r1
> [P1097R2] R. Martinho Fernandes, "Named character escapes", P1097R2, 2019.
> https://wg21.link/p1097r2
> [P2029R1] Tom Honermann, "Proposed resolution for core issues 411, 1656,
> and 2333; numeric and universal character escapes in character and string
> literals", P2029R1, 2020.
> https://wg21.link/p2029r1
> [UCESP] "Unicode Character Encoding Stability Policies", 2017.
> https://www.unicode.org/policies/stability_policy.html
> [UAX#44] Ken Whistler and Laurențiu Iancu, "Unicode Standard Annex #44 -
> Unicode Character Database", Revision 24, Unicode 12.0.0, 2019.
> https://www.unicode.org/reports/tr44/tr44-24.html Core wording
>
> These changes are relative to N4835
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4835.pdf>[N4835]
> <#m_6609508885546752959_ref_n4835>.
>
> If P2029R1 <https://wg21.link/p2029r1>[P2029R1]
> <#m_6609508885546752959_ref_p2029r1> is adopted, substantial wording
> updates will be required.
> Hide inserted text
> Hide deleted text
>
> Change in 5.2 [lex.phases] paragraph 5
> <http://eel.is/c++draft/lex.phases#1.5>:
>
> Each basic source character set member in a character literal or a string
> literal, as well as each escape sequence and, *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>, and
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence> in a character
> literal or a non-raw string literal, is converted to the corresponding
> member of the execution character set ([lex.ccon]
> <http://eel.is/c++draft/lex.ccon>, [lex.string]
> <http://eel.is/c++draft/lex.string>); if there is no corresponding
> member, it is converted to an implementation defined member other than the
> null (wide) character. 8 <http://eel.is/c++draft/lex.phases#footnote-8>
>
> Change in 5.13.3 [lex.ccon] <http://eel.is/c++draft/lex.ccon>:
>
> character-literal: <http://eel.is/c++draft/lex.ccon#nt:character-literal>
> encoding-prefix <http://eel.is/c++draft/lex.ccon#nt:encoding-prefix>opt '
> c-char-sequence <http://eel.is/c++draft/lex.ccon#nt:c-char-sequence> '
>
> encoding-prefix: <http://eel.is/c++draft/lex.ccon#nt:encoding-prefix> one
> of
> u8 u U L
>
> c-char-sequence: <http://eel.is/c++draft/lex.ccon#nt:c-char-sequence>
> c-char <http://eel.is/c++draft/lex.ccon#nt:c-char>
> c-char-sequence <http://eel.is/c++draft/lex.ccon#nt:c-char-sequence>
> c-char <http://eel.is/c++draft/lex.ccon#nt:c-char>
>
> c-char: <http://eel.is/c++draft/lex.ccon#nt:c-char>
> any member of the basic source character set except the single-quote ',
> backslash \, or new-line <http://eel.is/c++draft/cpp.pre#nt:new-line>
> character
> escape-sequence <http://eel.is/c++draft/lex.ccon#nt:escape-sequence>
> universal-character-name
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>
>
> escape-sequence: <http://eel.is/c++draft/lex.ccon#nt:escape-sequence>
> simple-escape-sequence
> <http://eel.is/c++draft/lex.ccon#nt:simple-escape-sequence>
> octal-escape-sequence
> <http://eel.is/c++draft/lex.ccon#nt:octal-escape-sequence>
> hexadecimal-escape-sequence
> <http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence>
> named-escape-sequence
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence>
>
> simple-escape-sequence:
> <http://eel.is/c++draft/lex.ccon#nt:simple-escape-sequence> one of
> \' \" \? \\
> \a \b \f \n \r \t \v
>
> octal-escape-sequence:
> <http://eel.is/c++draft/lex.ccon#nt:octal-escape-sequence>
> \ octal-digit <http://eel.is/c++draft/lex.icon#nt:octal-digit>
> \ octal-digit <http://eel.is/c++draft/lex.icon#nt:octal-digit> octal-digit
> <http://eel.is/c++draft/lex.icon#nt:octal-digit>
> \ octal-digit <http://eel.is/c++draft/lex.icon#nt:octal-digit> octal-digit
> <http://eel.is/c++draft/lex.icon#nt:octal-digit> octal-digit
> <http://eel.is/c++draft/lex.icon#nt:octal-digit>
>
> hexadecimal-escape-sequence:
> <http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence>
> \x hexadecimal-digit
> <http://eel.is/c++draft/lex.icon#nt:hexadecimal-digit>
> hexadecimal-escape-sequence
> <http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence>
> hexadecimal-digit <http://eel.is/c++draft/lex.icon#nt:hexadecimal-digit>
>
> named-escape-sequence:
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence>
> \N{ n-char-sequence <http://eel.is/c++draft/lex.ccon#nt:n-char-sequence> }
>
> n-char-sequence: <http://eel.is/c++draft/lex.ccon#nt:n-char-sequence>
> n-char <http://eel.is/c++draft/lex.ccon#nt:n-char>
> n-char <http://eel.is/c++draft/lex.ccon#nt:n-char> n-char-sequence
> <http://eel.is/c++draft/lex.ccon#nt:n-char-sequence>
>
> n-char: <http://eel.is/c++draft/lex.ccon#nt:n-char> one of
> A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> a b c d e f g h i j k l m n o p q r s t u v w x y z
> 0 1 2 3 4 5 6 7 8 9
> - space
>
> Change in 5.13.3 [lex.ccon] paragraph 7
> <http://eel.is/c++draft/lex.ccon#7>:
>
> Certain non-graphic characters, the single quote ', the double quote ",
> the question mark ?,19 <http://eel.is/c++draft/lex.ccon#footnote-19> and
> the backslash \, can be represented according to Table 8
> <http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc>. The double quote "
> and the question mark ?, can be represented as themselves or by the
> escape sequences \" and \? respectively, but the single quote ' and the
> backslash \ shall be represented by the escape sequences \' and \\
> respectively. Escape sequences in which the character following the
> backslash is not listed in Table 8
> <http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc> are
> conditionally-supported, with implementation-defined semantics. An escape
> sequence specifies a single character.
>
> Table 8 <http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc>: Escape
> sequences [tab:lex.ccon.esc]
> new-line NL(LF) \n
> horizontal tab HT \t
> vertical tab VT \v
> backspace BS \b
> carriage return CR \r
> form feed FF \f
> alert BEL \a
> backslash \ \\
> question mark ? \?
> single quote ' \'
> double quote " \"
> octal number ooo \ooo
> hex number hhh \xhhh
> named escape sequence named character \N{xxx}
>
> Add a new paragraph (X) after 5.13.3 [lex.ccon] paragraph 9
> <http://eel.is/c++draft/lex.ccon#9>:
> *Drafting Note:* Associated character names and character name aliases
> are listed in section 33 of ISO/IEC 10646:2017.
>
> A *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence> is translated
> to the encoding, in the appropriate execution character set, of the
> character associated with the ISO/IEC 10646 associated character name or character
> name alias that matches the name specified by the *n-char-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:n-char-sequence>. Matching of names
> is case-sensitive and whitespace-sensitive. If no name is matched, then the
> program is ill-formed.
>
> Change in 5.13.5 [lex.string] paragraph 14
> <http://eel.is/c++draft/lex.string#14>:
>
> Escape sequences and *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s in
> non-raw string literals have the same meaning as in character literals
> <http://eel.is/c++draft/lex.ccon> ([lex.ccon]
> <http://eel.is/c++draft/lex.ccon>), except that the single quote ' is
> representable either by itself or by the escape sequence \', and the
> double quote " shall be preceded by a \, and except that a
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name> or
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence> in a UTF-16
> string literal may yield a surrogate pair. In a narrow string literal, a
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name> or
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence> may map to
> more than one char or char8_t element due to *multibyte encoding*
> <http://eel.is/c++draft/lex.string#def:encoding,multibyte>. The size of a
> char32_t or wide string literal is the total number of escape sequences,
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s,
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence>s, and other
> characters, plus one for the terminating U'\0' or L'\0'. The size of a
> UTF-16 string literal is the total number of escape sequences,
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s,
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence>s, and other
> characters, plus one for each character requiring a surrogate pair, plus
> one for the terminating u'\0'. [ *Note:* The size of a char16_t string
> literal is the number of code units, not the number of characters. — *end
> note* ] Within char32_t and char16_t string literals, any
> *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>s shall
> be within the range 0x0 to 0x10FFFF. The size of a narrow string literal
> is the total number of escape sequences and other characters, plus at least
> one for the multibyte encoding of each *universal-character-name*
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>,
> *named-escape-sequence*
> <http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence>s, plus one for
> the terminating '\0'.
>
> Change in table 17 of 15.11 [cpp.predefined] paragraph 1.8
> <http://eel.is/c++draft/cpp.predefined#1.8>:
> *Drafting note:* the final value for the __cpp_named_character_escapes
> feature test macro will be selected by the project editor to reflect the
> date of approval.
>
> Table 17 — Feature-test macros [tab:cpp.predefined.ft]
> Macro name Value
> […] […]
> __cpp_modules 201907L
> __cpp_named_character_escapes XXXXXXL *** placeholder ***
> __cpp_namespace_attributes 201411L
> […] […]
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-09-15 14:49:51