C++ Logo

sg16

Advanced search

Re: [SG16] D1949R4 - Unicode Identifiers

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 May 2020 15:12:34 -0400
On 5/27/20 1:55 PM, Steve Downey via SG16 wrote:
> Most of the work for detecting non-normalized NFC is lookup of
> codepoints in a fairly small table. For the case of identifiers, it
> looks like some additional optimization may be possible, as many
> characters are already excluded.
> Adding
> Detection of un-normalized text is fairly straight-forward, and GCC 10
> already produces a warning. Unicode Annex 15, Unicode Normalization
> Forms, provides a quick check algorithm to test if a a string is in
> one of the normalization forms, driven by tables in the unicode
> database. See
> [Detecting_Normalization_Forms](https://unicode.org/reports/tr15/#Detecting_Normalization_Forms)
> in [@UAX15]. The tables are available at
> [DerivedNormalizationProps.txt](http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt).
> The check algorithm will sometimes need to normalize short ranges of
> text where detection of YES or NO is not possible for the single
> codepoint.

The added "Detecting_Normalization_Forms" link doesn't work for me; a
local href is generated.

Tom.

>
>
> On Wed, May 27, 2020 at 1:33 PM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 5/27/20 1:03 PM, Zach Laine via SG16 wrote:
> > On Wed, May 27, 2020 at 12:01 PM Jens Maurer via SG16
> > <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >> On 26/05/2020 22.51, Steve Downey via SG16 wrote:
> >>> Find attached a draft of the UAX31 paper for discussion.
> >>> Viewable at
> http://htmlpreview.github.io/?https://github.com/steve-downey/papers/blob/master/generated/p1949.html
> >>> Source at
> https://github.com/steve-downey/papers/blob/master/p1949.md
> >> I had asked earlier for some prose-text statement on the difficulty
> >> of checking NFC.
> >>
> >> I can only find
> >>
> >> "Detection of un-normalized text is fairly straight-forward,
> and GCC 10 already produces a warning. Normalizing to NFC is not
> much more difficult."
> >>
> >> which is lacking a bit of depth.
> >>
> >> What exactly do I have to do to check for NFC? Check some bits
> in the code points?
> >> Consult some Unicode tables? Something else?
> > You have to look up each adjacent pair of code points in a
> table, and
> > verify that they form a valid NFC sequence.
>
> I think Jens' point is that the paper doesn't state that (and that it
> should).
>
> Tom.
>
> >
> > Zach
>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>


Received on 2020-05-27 14:15:49