ISOCPP sg16 List: Re: Draft comment fixing Annex E to match our current understanding

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 26 May 2022 09:35:20 +0200

On 26/05/2022 08.57, Corentin Jabot wrote:
> See https://corp.unicode.org/pipermail/unicode/2021-March/009400.html <https://corp.unicode.org/pipermail/unicode/2021-March/009400.html>
>
> There was no intent for Pattern_White_Space to ever apply to C++-like languages.

Robin, active within Unicode, said otherwise in yesterday's SG16 telecon.
He claimed that the wording was just bad, and there are editorial updates
in the works for Unicode 15 to make this apply to lexing of programming
language source code, too.

https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf

The proposal below adjusts the Annex E wording to the future (Unicode 15+)
state.

LRM / RLM is expressly desired as almost-whitespace to
be able to handle e.g. Hebrew identifiers surrounded by characters without
strong directionality (e.g. "+" or digits). We agreed on addressing
this with a paper for C++26.

I note that the last sentence in the above e-mail reads
"So maybe just TAB, LF, CR, space (0020), and possibly wide space (3000),
plus also LRM/RLM/ALM at certain boundaries?"

which is very close to Pattern_White_Space, I think.

Jens

> Just saying: "C++ does not claim conformance with this requirement." would be sufficient and more correct. Otherwise, something along the lines of "C++ is not a pattern language so this requirement is not applicable". Or saying nothing, that would work too.
>
> On Thu, May 26, 2022 at 8:09 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 26/05/2022 02.19, Hubert Tong via SG16 wrote:
> > Suggestion:
> > UAX #31 describes what characters formal languages, such as computer
> > languages, should choose for use as whitespace and syntactically
> > significant characters during the process of lexical analysis. C++
> > does not claim conformance with this requirement.
>
> Sounds good to me.
>
> Jens
>
>
>
> > In particular, the "should describe and implement" wording implies
> > more comprehensive and broader advice than is given by UAX #31. Also,
> > lexing produces tokens from characters; past that, we are dealing with
> > tokens (not characters). Commas also added as an editorial change.
> >
> > On Wed, May 25, 2022 at 6:05 PM Steve Downey via SG16
> > <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >>
> >> E.4 R3 Pattern_White_Space and Pattern_Syntax characters[uaxid.pattern]
> >>
> >> 1
> >> #
> >> UAX #31 describes how languages that use or interpret patterns of characters, such as regular expressions or number formats, may describe that syntax with Unicode properties.
> >> 2
> >> #
> >> C++ does not do this as part of the language, deferring to library components for such usage of patterns. This requirement does not apply to C++.
> >>
> >> 1 UAX#31 describes how formal languages such as computer languages should describe and implement their use of whitespace and syntactically significant characters during the processes of lexing and parsing. C++ does not claim conformance with this requirement.
> >> --
> >> SG16 mailing list
> >> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> >> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>

Received on 2022-05-26 07:35:24