sg16: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Mon, 9 Aug 2021 14:51:35 +0000

It's not just performance (although my point about actually having to look at each byte of the input string is still the case). As far as I can tell the "less than one instruction per byte" vectorized approach described in the linked paper does not tell you _where_ the error occurred, so is not useful if you need to handle valid UTF-8 that follows invalid UTF-8.

In any case my main issue isn't with performance but that _forcing_ validation at program startup makes some perfectly reasonable programs (such as "cp" on windows) impossible to write.

If we offer a "std::arguments" type interface it's probably a good idea to offer both the UTF-8 and WTF-8 interfaces. This is, in fact, exactly what rust does, and I think they got this right. For the validated version they also validate each argument individually, so, for example, your program won't die if it's put in a directory that has bogus UTF-8 in its path (argv[0]).

Charlie.

-----Original Message-----
From: Peter Brett <pbrett_at_[hidden]>
Sent: Monday, August 9, 2021 5:19 AM
To: Charlie Barto <Charles.Barto_at_[hidden]>
Cc: sg16_at_[hidden]
Subject: RE: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire

Best regards,

                  Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Charlie Barto
> via
> SG16
> Sent: 29 July 2021 23:33
> To: Corentin Jabot <corentinjabot_at_[hidden]>
>
> yes, and in any case if we wanted to ensure the parameters were
> actually
> utf-8 the runtime startup code would have to do that check. If users
> are checking they defer or omit validity checks in some cases. This
> can be important, to check that the string is actually well formed you
> need to _actually look_ at every single byte and then do a sequence of
> probably a few dozen instructions to decide if it's valid. Sometimes
> it's OK to just assume it _is_ valid if you don't do anything that
> actually requires the whole thing be valid. For example you may
> linearly search for delimiters then parse text between them, as long
> as you are careful about validating the text between the delimiters it
> doesn't matter if some other part of the string is bogus, and you
> never have to execute the instructions that would check those other bits of the string.

Received on 2021-08-09 09:51:40