C++ Logo

SG16

Advanced search

Subject: Re: A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding
From: Peter Brett (pbrett_at_[hidden])
Date: 2021-08-09 07:19:07


Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire
    https://doi.org/10.1002/spe.2920
    https://arxiv.org/abs/2010.03090

Best regards,

                  Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Charlie Barto via
> SG16
> Sent: 29 July 2021 23:33
> To: Corentin Jabot <corentinjabot_at_[hidden]>
>
> yes, and in any case if we wanted to ensure the parameters were actually
> utf-8 the runtime startup code would have to do that check. If users are
> checking they defer or omit validity checks in some cases. This can be
> important, to check that the string is actually well formed you need to
> _actually look_ at every single byte and then do a sequence of probably a
> few dozen instructions to decide if it's valid. Sometimes it's OK to just
> assume it _is_ valid if you don't do anything that actually requires the
> whole thing be valid. For example you may linearly search for delimiters
> then parse text between them, as long as you are careful about validating
> the text between the delimiters it doesn't matter if some other part of
> the string is bogus, and you never have to execute the instructions that
> would check those other bits of the string.


SG16 list run by sg16-owner@lists.isocpp.org