Date: Wed, 27 May 2020 15:30:09 -0400
On 5/27/20 3:12 PM, Tom Honermann via SG16 wrote:
> On 5/27/20 1:55 PM, Steve Downey via SG16 wrote:
>> Most of the work for detecting non-normalized NFC is lookup of
>> codepoints in a fairly small table. For the case of identifiers, it
>> looks like some additional optimization may be possible, as many
>> characters are already excluded.
>> Adding
>> Detection of un-normalized text is fairly straight-forward, and GCC
>> 10 already produces a warning. Unicode Annex 15, Unicode
>> Normalization Forms, provides a quick check algorithm to test if a a
>> string is in one of the normalization forms, driven by tables in the
>> unicode database. See
>> [Detecting_Normalization_Forms](https://unicode.org/reports/tr15/#Detecting_Normalization_Forms)
>> in [@UAX15]. The tables are available at
>> [DerivedNormalizationProps.txt](http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt).
>> The check algorithm will sometimes need to normalize short ranges of
>> text where detection of YES or NO is not possible for the single
>> codepoint.
>
> The added "Detecting_Normalization_Forms" link doesn't work for me; a
> local href is generated.
>
Actually, I think that only happens with the github rendered preview.
The last rendered version sent to the mailing list links properly. Ignore.
>
> Tom.
>
>>
>>
>> On Wed, May 27, 2020 at 1:33 PM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> On 5/27/20 1:03 PM, Zach Laine via SG16 wrote:
>> > On Wed, May 27, 2020 at 12:01 PM Jens Maurer via SG16
>> > <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> >> On 26/05/2020 22.51, Steve Downey via SG16 wrote:
>> >>> Find attached a draft of the UAX31 paper for discussion.
>> >>> Viewable at
>> http://htmlpreview.github.io/?https://github.com/steve-downey/papers/blob/master/generated/p1949.html
>> >>> Source at
>> https://github.com/steve-downey/papers/blob/master/p1949.md
>> >> I had asked earlier for some prose-text statement on the
>> difficulty
>> >> of checking NFC.
>> >>
>> >> I can only find
>> >>
>> >> "Detection of un-normalized text is fairly straight-forward,
>> and GCC 10 already produces a warning. Normalizing to NFC is not
>> much more difficult."
>> >>
>> >> which is lacking a bit of depth.
>> >>
>> >> What exactly do I have to do to check for NFC? Check some bits
>> in the code points?
>> >> Consult some Unicode tables? Something else?
>> > You have to look up each adjacent pair of code points in a
>> table, and
>> > verify that they form a valid NFC sequence.
>>
>> I think Jens' point is that the paper doesn't state that (and
>> that it
>> should).
>>
>> Tom.
>>
>> >
>> > Zach
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>
>
> On 5/27/20 1:55 PM, Steve Downey via SG16 wrote:
>> Most of the work for detecting non-normalized NFC is lookup of
>> codepoints in a fairly small table. For the case of identifiers, it
>> looks like some additional optimization may be possible, as many
>> characters are already excluded.
>> Adding
>> Detection of un-normalized text is fairly straight-forward, and GCC
>> 10 already produces a warning. Unicode Annex 15, Unicode
>> Normalization Forms, provides a quick check algorithm to test if a a
>> string is in one of the normalization forms, driven by tables in the
>> unicode database. See
>> [Detecting_Normalization_Forms](https://unicode.org/reports/tr15/#Detecting_Normalization_Forms)
>> in [@UAX15]. The tables are available at
>> [DerivedNormalizationProps.txt](http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt).
>> The check algorithm will sometimes need to normalize short ranges of
>> text where detection of YES or NO is not possible for the single
>> codepoint.
>
> The added "Detecting_Normalization_Forms" link doesn't work for me; a
> local href is generated.
>
Actually, I think that only happens with the github rendered preview.
The last rendered version sent to the mailing list links properly. Ignore.
>
> Tom.
>
>>
>>
>> On Wed, May 27, 2020 at 1:33 PM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> On 5/27/20 1:03 PM, Zach Laine via SG16 wrote:
>> > On Wed, May 27, 2020 at 12:01 PM Jens Maurer via SG16
>> > <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> >> On 26/05/2020 22.51, Steve Downey via SG16 wrote:
>> >>> Find attached a draft of the UAX31 paper for discussion.
>> >>> Viewable at
>> http://htmlpreview.github.io/?https://github.com/steve-downey/papers/blob/master/generated/p1949.html
>> >>> Source at
>> https://github.com/steve-downey/papers/blob/master/p1949.md
>> >> I had asked earlier for some prose-text statement on the
>> difficulty
>> >> of checking NFC.
>> >>
>> >> I can only find
>> >>
>> >> "Detection of un-normalized text is fairly straight-forward,
>> and GCC 10 already produces a warning. Normalizing to NFC is not
>> much more difficult."
>> >>
>> >> which is lacking a bit of depth.
>> >>
>> >> What exactly do I have to do to check for NFC? Check some bits
>> in the code points?
>> >> Consult some Unicode tables? Something else?
>> > You have to look up each adjacent pair of code points in a
>> table, and
>> > verify that they form a valid NFC sequence.
>>
>> I think Jens' point is that the paper doesn't state that (and
>> that it
>> should).
>>
>> Tom.
>>
>> >
>> > Zach
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>
>
Received on 2020-05-27 14:33:15