sg16: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 9 Aug 2021 12:19:07 +0000

Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire
    https://doi.org/10.1002/spe.2920
    https://arxiv.org/abs/2010.03090

Best regards,

                  Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Charlie Barto via
> SG16
> Sent: 29 July 2021 23:33
> To: Corentin Jabot <corentinjabot_at_gmail.com>
>
> yes, and in any case if we wanted to ensure the parameters were actually
> utf-8 the runtime startup code would have to do that check. If users are
> checking they defer or omit validity checks in some cases. This can be
> important, to check that the string is actually well formed you need to
> _actually look_ at every single byte and then do a sequence of probably a
> few dozen instructions to decide if it's valid. Sometimes it's OK to just
> assume it _is_ valid if you don't do anything that actually requires the
> whole thing be valid. For example you may linearly search for delimiters
> then parse text between them, as long as you are careful about validating
> the text between the delimiters it doesn't matter if some other part of
> the string is bogus, and you never have to execute the instructions that
> would check those other bits of the string.

Received on 2021-08-09 07:19:13