Date: Wed, 27 May 2020 13:55:04 -0400
Most of the work for detecting non-normalized NFC is lookup of codepoints
in a fairly small table. For the case of identifiers, it looks like some
additional optimization may be possible, as many characters are already
excluded.
Adding
Detection of un-normalized text is fairly straight-forward, and GCC 10
already produces a warning. Unicode Annex 15, Unicode Normalization Forms,
provides a quick check algorithm to test if a a string is in one of the
normalization forms, driven by tables in the unicode database. See
[Detecting_Normalization_Forms](
https://unicode.org/reports/tr15/#Detecting_Normalization_Forms) in
[@UAX15]. The tables are available at [DerivedNormalizationProps.txt](
http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt).
The check algorithm will sometimes need to normalize short ranges of text
where detection of YES or NO is not possible for the single codepoint.
On Wed, May 27, 2020 at 1:33 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:
> On 5/27/20 1:03 PM, Zach Laine via SG16 wrote:
> > On Wed, May 27, 2020 at 12:01 PM Jens Maurer via SG16
> > <sg16_at_[hidden]> wrote:
> >> On 26/05/2020 22.51, Steve Downey via SG16 wrote:
> >>> Find attached a draft of the UAX31 paper for discussion.
> >>> Viewable at
> http://htmlpreview.github.io/?https://github.com/steve-downey/papers/blob/master/generated/p1949.html
> >>> Source at https://github.com/steve-downey/papers/blob/master/p1949.md
> >> I had asked earlier for some prose-text statement on the difficulty
> >> of checking NFC.
> >>
> >> I can only find
> >>
> >> "Detection of un-normalized text is fairly straight-forward, and GCC 10
> already produces a warning. Normalizing to NFC is not much more difficult."
> >>
> >> which is lacking a bit of depth.
> >>
> >> What exactly do I have to do to check for NFC? Check some bits in the
> code points?
> >> Consult some Unicode tables? Something else?
> > You have to look up each adjacent pair of code points in a table, and
> > verify that they form a valid NFC sequence.
>
> I think Jens' point is that the paper doesn't state that (and that it
> should).
>
> Tom.
>
> >
> > Zach
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
in a fairly small table. For the case of identifiers, it looks like some
additional optimization may be possible, as many characters are already
excluded.
Adding
Detection of un-normalized text is fairly straight-forward, and GCC 10
already produces a warning. Unicode Annex 15, Unicode Normalization Forms,
provides a quick check algorithm to test if a a string is in one of the
normalization forms, driven by tables in the unicode database. See
[Detecting_Normalization_Forms](
https://unicode.org/reports/tr15/#Detecting_Normalization_Forms) in
[@UAX15]. The tables are available at [DerivedNormalizationProps.txt](
http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt).
The check algorithm will sometimes need to normalize short ranges of text
where detection of YES or NO is not possible for the single codepoint.
On Wed, May 27, 2020 at 1:33 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:
> On 5/27/20 1:03 PM, Zach Laine via SG16 wrote:
> > On Wed, May 27, 2020 at 12:01 PM Jens Maurer via SG16
> > <sg16_at_[hidden]> wrote:
> >> On 26/05/2020 22.51, Steve Downey via SG16 wrote:
> >>> Find attached a draft of the UAX31 paper for discussion.
> >>> Viewable at
> http://htmlpreview.github.io/?https://github.com/steve-downey/papers/blob/master/generated/p1949.html
> >>> Source at https://github.com/steve-downey/papers/blob/master/p1949.md
> >> I had asked earlier for some prose-text statement on the difficulty
> >> of checking NFC.
> >>
> >> I can only find
> >>
> >> "Detection of un-normalized text is fairly straight-forward, and GCC 10
> already produces a warning. Normalizing to NFC is not much more difficult."
> >>
> >> which is lacking a bit of depth.
> >>
> >> What exactly do I have to do to check for NFC? Check some bits in the
> code points?
> >> Consult some Unicode tables? Something else?
> > You have to look up each adjacent pair of code points in a table, and
> > verify that they form a valid NFC sequence.
>
> I think Jens' point is that the paper doesn't state that (and that it
> should).
>
> Tom.
>
> >
> > Zach
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2020-05-27 12:58:21